-
Free Galician morphological database for Majka
Data for assigning lemmata and tags to analyzed word forms for Majka. Majka is a free morphological analyzer that can be downloaded from https://nlp.fi.muni.cz/ma/ alongside... -
Automatically Annotated Corpora with Stanza and UDPipe for Czech, English, an...
This resource contains six automatically annotated corpora derived from the Leipzig Corpora Collection, covering three languages: Czech, English, and Greek. For each language,... -
Free Catalan morphological database for Majka
Data for assigning lemmata and tags to analyzed word forms for Majka. Majka is a free morphological analyzer that can be downloaded from https://nlp.fi.muni.cz/ma/ alongside... -
Free Polish morphological database for Majka
Data for assigning lemmata and tags to analyzed word forms for Majka. Majka is a free morphological analyzer that can be downloaded from https://nlp.fi.muni.cz/ma/ alongside... -
Lexicon of Lithuanian Basketball Slang Terms
The lexicon is compiled applying the method of crowdsourcing using the dictionary-editing system LEXONOMY. It was compiled as a study project by the group of students in the... -
Slovenian legal natural language inference dataset SLawNLI
SLawNLI is a human-annotated dataset for Natural Language Inference (NLI) in the Slovenian legal domain. It contains 2,214 examples constructed according to the standard NLI... -
Slovene morphological segmentation and word formation dataset KOBOS
This dataset provides word-level multidimensional morphological annotations for Slovene, containing 1,935 entries manually annotated by two domain experts. The target words in... -
Projekt_ZDH_transkripce
Text written in kurrent transcribed through Transkribus and then finished by hand. -
Collection of Slovenian legal texts COLESLAW 1.0
COLESLAW 1.0 is a large-scale collection of Slovenian legal texts compiled from authoritative public sources. The corpus covers legislative, judicial, and governmental legal... -
A multilingual benchmark for evaluating metalinguistic knowledge WALS-Bench 1.0
This is a large-scale multilingual benchmark for evaluating metalinguistic knowledge (i.e. explicit knowledge about the structure of languages) in large language models using... -
SYN v14: large corpus of written Czech
Corpus of contemporary written (printed) Czech sized almost 5.5 GW (i.e. 6.6 billion tokens). It covers mostly the 1990-2024 period and features rich metadata including detailed... -
Code and data accompanying the SynSemClass paper @ LREC 2026
Snapshot of code and data accompanying the paper accepted at LREC 2026: "Automatic Suggestions Help Extending Eventive Ontology: A Case Study on SynSemClass". The timestamp of... -
SYN2025: representative corpus of written Czech
Representative corpus of contemporary written (printed) Czech sized 100 MW. It was created as a representation of printed language from 2020–2024 containing a wide range of text... -
Multilingual training dataset for CAP policy topic classification ParlaCAP-train
The multilingual training dataset for CAP policy topic classification ParlaCAP-train is a collection of parliamentary speeches in 29 European languages, automatically annotated... -
Thesaurus of Modern Slovene 2.2
Thesaurus of Modern Slovene is the largest automatically generated open-access collection of Slovene synonyms. The current version 2.2 contains 102,068 keywords and 362,464... -
Comprehensive Slovenian-Hungarian Dictionary 3.0
The Comprehensive Slovenian-Hungarian dictionary is a general bilingual dictionary that is being compiled at the Centre for Language Resources and Technologies of the University... -
Collocations Dictionary of Modern Slovene KSSS 2.2
The database of the Collocations Dictionary of Modern Slovene 2.2 contains 4,425,942 collocations in 78,046 entries. Collocations occur in 81 different syntactic relations.... -
Slovene instruction-following dataset for large language models GaMS-Instruct...
GaMS-Instruct-MED-Termset is an instruction-following dataset containing 975,060 prompt-response units in Slovene from the medical domain. It focuses on medical terms, with... -
Corpus-grounded evaluation dataset for grammatical question answering GramQA 1.0
The Corpus-grounded evaluation dataset for grammatical question answering (GramQA) consists of 13 grammatical questions inspired by WALS, the World Atlas of Language Structures... -
Slovene instruction-following dataset for large language models GaMS-Instruct...
GaMS-Instruct-MED-Anatomy is an instruction-following dataset containing 711,805 prompt-response units in Slovene (with English and Latin terminology). The units form a...
