CLARIN - Repositories

Free Galician morphological database for Majka

Data for assigning lemmata and tags to analyzed word forms for Majka. Majka is a free morphological analyzer that can be downloaded from https://nlp.fi.muni.cz/ma/ alongside...

Automatically Annotated Corpora with Stanza and UDPipe for Czech, English, an...

This resource contains six automatically annotated corpora derived from the Leipzig Corpora Collection, covering three languages: Czech, English, and Greek. For each language,...

Free Catalan morphological database for Majka

Data for assigning lemmata and tags to analyzed word forms for Majka. Majka is a free morphological analyzer that can be downloaded from https://nlp.fi.muni.cz/ma/ alongside...

Free Polish morphological database for Majka

Data for assigning lemmata and tags to analyzed word forms for Majka. Majka is a free morphological analyzer that can be downloaded from https://nlp.fi.muni.cz/ma/ alongside...

Lexicon of Lithuanian Basketball Slang Terms

The lexicon is compiled applying the method of crowdsourcing using the dictionary-editing system LEXONOMY. It was compiled as a study project by the group of students in the...

Slovenian legal natural language inference dataset SLawNLI

SLawNLI is a human-annotated dataset for Natural Language Inference (NLI) in the Slovenian legal domain. It contains 2,214 examples constructed according to the standard NLI...

Slovene morphological segmentation and word formation dataset KOBOS

This dataset provides word-level multidimensional morphological annotations for Slovene, containing 1,935 entries manually annotated by two domain experts. The target words in...

Projekt_ZDH_transkripce

Text written in kurrent transcribed through Transkribus and then finished by hand.

Collection of Slovenian legal texts COLESLAW 1.0

COLESLAW 1.0 is a large-scale collection of Slovenian legal texts compiled from authoritative public sources. The corpus covers legislative, judicial, and governmental legal...

A multilingual benchmark for evaluating metalinguistic knowledge WALS-Bench 1.0

This is a large-scale multilingual benchmark for evaluating metalinguistic knowledge (i.e. explicit knowledge about the structure of languages) in large language models using...

SYN v14: large corpus of written Czech

Corpus of contemporary written (printed) Czech sized almost 5.5 GW (i.e. 6.6 billion tokens). It covers mostly the 1990-2024 period and features rich metadata including detailed...

Code and data accompanying the SynSemClass paper @ LREC 2026

Snapshot of code and data accompanying the paper accepted at LREC 2026: "Automatic Suggestions Help Extending Eventive Ontology: A Case Study on SynSemClass". The timestamp of...

SYN2025: representative corpus of written Czech

Representative corpus of contemporary written (printed) Czech sized 100 MW. It was created as a representation of printed language from 2020–2024 containing a wide range of text...

Multilingual training dataset for CAP policy topic classification ParlaCAP-train

The multilingual training dataset for CAP policy topic classification ParlaCAP-train is a collection of parliamentary speeches in 29 European languages, automatically annotated...

Thesaurus of Modern Slovene 2.2

Thesaurus of Modern Slovene is the largest automatically generated open-access collection of Slovene synonyms. The current version 2.2 contains 102,068 keywords and 362,464...

Comprehensive Slovenian-Hungarian Dictionary 3.0

The Comprehensive Slovenian-Hungarian dictionary is a general bilingual dictionary that is being compiled at the Centre for Language Resources and Technologies of the University...

Collocations Dictionary of Modern Slovene KSSS 2.2

The database of the Collocations Dictionary of Modern Slovene 2.2 contains 4,425,942 collocations in 78,046 entries. Collocations occur in 81 different syntactic relations....

Slovene instruction-following dataset for large language models GaMS-Instruct...

GaMS-Instruct-MED-Termset is an instruction-following dataset containing 975,060 prompt-response units in Slovene from the medical domain. It focuses on medical terms, with...

Corpus-grounded evaluation dataset for grammatical question answering GramQA 1.0

The Corpus-grounded evaluation dataset for grammatical question answering (GramQA) consists of 13 grammatical questions inspired by WALS, the World Atlas of Language Structures...

Slovene instruction-following dataset for large language models GaMS-Instruct...

GaMS-Instruct-MED-Anatomy is an instruction-following dataset containing 711,805 prompt-response units in Slovene (with English and Latin terminology). The units form a...

1,492 datasets found