-
Trans-related Online Corpus (TROC)
The resource includes three corpora. 0. Raw corpus of YouTube and Reddit comments (see readme.txt). 1. Stance annotated corpus of contra-trans and pro-trans comments... -
Lexicon of Lithuanian Basketball Slang Terms
The lexicon is compiled applying the method of crowdsourcing using the dictionary-editing system LEXONOMY. It was compiled as a study project by the group of students in the... -
Eesti keele spontaanse kõne foneetiline korpus v.1.0.0 Phonetic Corpus of Es...
The aim of the corpus is to compile a large amount of quality recordings of spontaneous Estonian and segment it phonetically on different levels. The project started in autumn... -
Mulgi sõnaraamat
Mulgi suur veebisõnaraamat esitab Mulgi murdesõnu kõigist viiest Mulgimaa kihelkonnast: Hallistest, Helmest, Karksist, Paistust ning Tarvastust. Sõnaraamatus on üle 20 000... -
Slovenian legal natural language inference dataset SLawNLI
SLawNLI is a human-annotated dataset for Natural Language Inference (NLI) in the Slovenian legal domain. It contains 2,214 examples constructed according to the standard NLI... -
Slovenian translation corpus Spook 1.1
The Spook corpus was compiled to enable corpus-based studies in translation and comprises 713 texts and about 375 thousand words. It is composed of three types of texts. The... -
Slovene morphological segmentation and word formation dataset KOBOS
This dataset provides word-level multidimensional morphological annotations for Slovene, containing 1,935 entries manually annotated by two domain experts. The target words in... -
Verbs annotated for morphemic structure in Czech, English, German, Spanish v2
A sample of verb lemmas in four languages: Czech (19,040 lemmas), English (9,969 lemmas), German (27,158 lemmas), Spanish (11,768 lemmas). Each verb lemma is annotated for its... -
Lexical Dataset of Czech nene- Constructions (2026-02-28)
Databáze dokladů slov s dvojí záporkou nene- v češtině, sloužící primárně pro účely bakalářské práce na ÚČJTK FFUK (Dvojitá negace nene- (typ nenedostal, nenepatrný) v češtině;... -
HeCz: Large Scale Self-Paced Reading Corpus Newspaper Headlines in Czech
The HeCz corpus comprises self-paced reading data for 1919 newspaper headlines (23,634 words) in Czech, with each headline being accompanied by a yes–no comprehension question,... -
Free Welsh morphological database for Majka
Data for assigning lemmata and tags to analyzed word forms for Majka. Majka is a free morphological analyzer that can be downloaded from https://nlp.fi.muni.cz/ma/ alongside... -
Free Russian morphological database for Majka
Data for assigning lemmata and tags to analyzed word forms for Majka. Majka is a free morphological analyzer that can be downloaded from https://nlp.fi.muni.cz/ma/ alongside... -
Free Asturian morphological database for Majka
Data for assigning lemmata and tags to analyzed word forms for Majka. Majka is a free morphological analyzer that can be downloaded from https://nlp.fi.muni.cz/ma/ alongside... -
Projekt_ZDH_transkripce
Text written in kurrent transcribed through Transkribus and then finished by hand. -
English (L2) Learner Corpus
The NEC corpus samples used in the study comprises 433 examination responses (essays) written in L2 English on two topics, namely, The importance of volunteering for young... -
English (L2) Learner Corpus
This data was used to analyze learner writing reported in Juknevičienė and Šeškauskienė (2014). It consists of two resources: a corpus of learner writing and a summative results... -
Monitor corpus of Slovene Trendi 2026-02
The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 60 publishers. Trendi 2026-02 covers the period from January... -
Collection of Slovenian legal texts COLESLAW 1.0
COLESLAW 1.0 is a large-scale collection of Slovenian legal texts compiled from authoritative public sources. The corpus covers legislative, judicial, and governmental legal... -
A multilingual benchmark for evaluating metalinguistic knowledge WALS-Bench 1.0
This is a large-scale multilingual benchmark for evaluating metalinguistic knowledge (i.e. explicit knowledge about the structure of languages) in large language models using... -
Monitor corpus of Slovene Trendi 2026-01
The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 60 publishers. Trendi 2026-01 covers the period from January...
