4,412 datasets found

Filter Results
  • Ukrainian-English parallel corpus MaCoCu-uk-en 1.0

    The Ukrainian-English parallel corpus MaCoCu-uk-en 1.0 was built by crawling the ".ua" and ".укр" internet top-level domain in 2022, extending the crawl dynamically to other...
  • Catalan-English parallel corpus MaCoCu-ca-en 1.0

    The Catalan-English parallel corpus MaCoCu-ca-en 1.0 was built by crawling the ".cat", ".es", ".ad", ".fr", ".it" and ".eu” internet top-level domain in 2022, extending the...
  • Greek-English parallel corpus MaCoCu-el-en 1.0

    The Greek-English parallel corpus MaCoCu-el-en 1.0 was built by crawling the “.gr", ".ελ", ".cy" and ".eu" internet top-level domain in 2023, extending the crawl dynamically to...
  • Carniolan Provincial Assembly corpus Kranjska 1.0

    The corpus contains meeting proceedings of the Carniolan Provincial Assembly from 1861 to 1913 (Obravnave deželnega zbora kranjskega / Bericht über die Verhandlungen des...
  • Monitor corpus of Slovene Trendi 2023-02

    The Trendi corpus is a monitor corpus of Slovene. It contains news from 107 different media websites, published by 72 different publishers. Trendi 2023-02 covers the period from...
  • Corpus of Serbian Forms of Address 1.1

    The corpus consists of transcripts of audio-recorded biographical interviews with 19 participants. The interviews are about forms of address that speakers use in colloquial and...
  • Corpus of combined Slovenian corpora metaFida 1.0

    Slovenia has a large number of diverse corpora available for online analysis via the CLARIN.SI concordancers. However, if users are interested in the same queries across...
  • Corpus of scientific texts from the Open Science Slovenia portal OSS 1.0

    OSS is a large collection of scientific writing in the Slovenian language gathered from the Open Science Slovenia portal (https://openscience.si). It consists of over 150...
  • Spoken corpus Gos 2.0 (transcriptions)

    The spoken corpus Gos 2.0 is the reference speech corpus of the Slovenian language. This second edition contains about 300 hours of speech, or 2.4 million words, 127 thousand...
  • Corpus of Slovene linguistic scientific writing JezKor

    JezKor is a collection of linguistic scientific writing in the Slovenian language. It consists of 43 monographs published between 2009 and 2022 by Fran Ramovš institute of...
  • Slovene learner corpus KOST 1.0

    The corpus of Slovene as a foreign language KOST (Korpus slovenščine kot tujega jezika) contains 6,311 texts (just over 1 million words) written by adult speakers for whom...
  • Serbian Web Corpus PDRS 1.0

    PDRS 1.0 is a web corpus based on crawling the .rs domain. Crawling has been done in September and October 2022 with BootCat. As search terms, appr. 2,800 word forms with a...
  • Map task corpus of heritage BCMS 1.0

    The Map task corpus of heritage Bosnian/Croatian/Montenegrin/Serbian (BCMS) consists of elicited conversations (map tasks) by 29 second-generation BCMS speakers originating from...
  • Slovenian parliamentary corpus (1990-2022) siParl 3.0

    The siParl corpus contains minutes of the Assembly of the Republic of Slovenia for 11th legislative period 1990-1992, minutes of the National Assembly of the Republic of...
  • Training corpus SUK 1.0

    The SUK training corpus contains about 1 million tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation, with...
  • Corpus of combined Slovenian corpora MetaFida 0.1

    Slovenia has a large number of diverse corpora available for online analysis via the CLARIN.SI concordancers. However, if users are interested in the same queries across...
  • CMC training corpus Janes-Norm 3.0

    Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consisting of about 20,000 short texts (280,000 words), mostly tweets but also blogs,...
  • CMC training corpus Janes-Tag 3.0

    Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consisting of about 15,000 short texts (190,000 words), mostly tweets but also blogs,...
  • Developmental corpus Šolar 3.0

    The Developmental corpus Šolar consists of 5,485 texts written by students in Slovenian secondary schools (age 15-19) and pupils in the 7th-9th grade of primary school (13-15),...
  • Corpus of 1968 Slovenian literature Maj68 2.0

    Maj68 corpus contains 1,521 texts by 198 known authors published between 1964 and 1972 in the periodicals "Tribuna", "Problemi" and "Problemi. Literatura." The texts contain...