Dataset - B2FIND

Nimeüksuste korpus Estonian NER corpus

Corpus containing morphologically analyzed articles with named entity annotations (persons, organizations, locations) in BOI format.

Eesti murdekorpus Estonian Dialect Corpus

korpus More info at https://murre.ut.ee/ The dialect corpus consists of: 1) Dialect recordings. The corpus is based on dialect recordings which have mainly been made in the...

Eesti keele spontaanse kõne foneetiline korpus v.1.0.0 Phonetic Corpus of Es...

The aim of the corpus is to compile a large amount of quality recordings of spontaneous Estonian and segment it phonetically on different levels. The project started in autumn...

Segakorpus: Doktoritööd Corpus of Estonian scientific texts

Korpus sisaldab 5 miljonit sõna eestikeelset teaduskirjandust: doktoritööd (2,3 miljonit sõna) ja teadusartiklid. TEI P5 XML märgendus, UTF8 kodeering. More info at...

Pindsüntaktiliselt analüüsitud korpus Estonian corpus with shallow syntactic...

This corpus is a monolingual corpus with Constraint Grammar-style shallow syntactic annotations.

Eesti ilukirjanduse korpus Corpus of Estonian fiction

Eesti ilukirjanduse korpus alates 1990. Kokku 5,6 miljonit sõna. More info at http://www.cl.ut.ee/korpused/segakorpus/eesti_ilukirjandus_1990 A text corpus containing Estonian...

Segakorpus: Riigikogu Corpus of the Proceedings of Estonian Parliament

Riigikogu korpus. TEI P5 XML märgendus, UTF8 kodeering. More info at http://www.cl.ut.ee/korpused/segakorpus/riigikogu Corpus of the Proceedings of Estonian Parliament. Contains...

Eesti emotsionaalse kõne korpus Estonian Emotional Speech Corpus

Korpus sisaldab 1234 eestikeelset viha-, rõõmu- ja kurbuse emotsiooniga lauset ning neutraalset lauset. Naishääl, 44.1 KHz, 16Bit, Mono; wav, textgrid:...

Eesti ajakirjanduse korpus Corpus of Estonian newspaper texts

Korpus sisaldab eesti ajalehti, 182 miljonit sõna. TEI P5 XML märgendus, UTF8 kodeering. More info at http://www.cl.ut.ee/korpused/ Corpus of Estonian newspaper texts, 182...

Eesti puudepanga korpus Estonian Treebank

Estonian Treebank is available both in the VISL and TigerXML format. Esttre consists of ca 1400 manually annotated sentences (10600 tokens), the text classes represented in the...

Morfoloogiliselt ühestatud korpus Corpus of morphologically disambiguated Es...

Käsitis morfoloogiliselt ühestatud korpus Manually annotated corpus. Available for download and via Korp query.

Suulise keele korpus Corpus of Spoken Estonian

The Department of Estonian Language initiated the corpus of spoken Estonian in 1997. The corpus is compiled by the research group of Spoken Estonian (Tiit Hennoste, Airi...

EKI veamärgendatud E2 õppijakorpus (versioon 2) EKI error-annotated Estonian...

Veamärgendatud korpuse materjalid põhinevad EMMA õppijakeelekorpusel, sisaldades andmeid Haridus- ja Noorteameti tasemetöödest (7. klass, 504 teksti), põhikooli lõpueksamitest...

Eesti keele ühendkorpus 2023 (annoteerimata) Estonian National Corpus 2023 (...

Estonian corpus of written texts. Consists of the Estonian Reference Corpus (90s–2008), Contemporary and old literature, Estonian Web (2013, 2017, 2019, 2021, 2023), Timestamped...

Large-Scale Colloquial Persian 0.5

"Large Scale Colloquial Persian Dataset" (LSCP) is hierarchically organized in asemantic taxonomy that focuses on multi-task informal Persian language understanding as a...

Szeged Corpus 2.0

written, monolingual, general, manually POS annotated reference corpus; 1,459,288 tokens; MSD tagset, XML (TEI P4) files

CoCzeFLA Chroma 2022.07

Transcripts of longitudinal audio recordings of 7 Czech typical monolingual children between 1;7 to 3;9. Files are in plain text with UTF-8 encoding. Each file represents one...

Plaintext Wikipedia dump 2018

Wikipedia plain text data obtained from Wikipedia dumps with WikiExtractor in February 2018. The data come from all Wikipedias for which dumps could be downloaded at...

Szeged Corpus 1.0

written, monolingual, general, manually POS annotated reference corpus; 1,247,546 tokens; MSD tagset, XML (TEIxLite) files

The IPI PAN Corpus

written, general, monolingual, synchronic; 250 million; XML (XCES), morphosyntactic, structural, metada

38 datasets found