CLARIN - Repositories

SALSA - The SAarbrücken Lexical Semantics Annotation and Analysis Project

The SALSA corpus is based on the TIGER corpus. The TIGER corpus (Version 2.1) consists of app. 900,000 tokens (50,000 sentences) of German newspaper text, taken from the...

epic-uds

This dataset has no description

Multilingual training dataset for CAP policy topic classification ParlaCAP-train

The multilingual training dataset for CAP policy topic classification ParlaCAP-train is a collection of parliamentary speeches in 29 European languages, automatically annotated...

Thesaurus of Modern Slovene 2.2

Thesaurus of Modern Slovene is the largest automatically generated open-access collection of Slovene synonyms. The current version 2.2 contains 102,068 keywords and 362,464...

Comprehensive Slovenian-Hungarian Dictionary 3.0

The Comprehensive Slovenian-Hungarian dictionary is a general bilingual dictionary that is being compiled at the Centre for Language Resources and Technologies of the University...

Collocations Dictionary of Modern Slovene KSSS 2.2

The database of the Collocations Dictionary of Modern Slovene 2.2 contains 4,425,942 collocations in 78,046 entries. Collocations occur in 81 different syntactic relations....

Slovene instruction-following dataset for large language models GaMS-Instruct...

GaMS-Instruct-MED-Termset is an instruction-following dataset containing 975,060 prompt-response units in Slovene from the medical domain. It focuses on medical terms, with...

Corpus-grounded evaluation dataset for grammatical question answering GramQA 1.0

The Corpus-grounded evaluation dataset for grammatical question answering (GramQA) consists of 13 grammatical questions inspired by WALS, the World Atlas of Language Structures...

Slovene instruction-following dataset for large language models GaMS-Instruct...

GaMS-Instruct-MED-Anatomy is an instruction-following dataset containing 711,805 prompt-response units in Slovene (with English and Latin terminology). The units form a...

Comprehensive Slovenian-Hungarian Dictionary 2.0

The Comprehensive Slovenian-Hungarian dictionary is a general bilingual dictionary that is being compiled at the Centre for Language Resources and Technologies of the University...

Collocations Dictionary of Modern Slovene KSSS 2.0

The database of the Collocations Dictionary of Modern Slovene 2.0 contains 4,491,958 collocations in 81,443 entries. Collocations occur in 81 different syntactic relations....

Thesaurus of Modern Slovene 2.0

Thesaurus of Modern Slovene is the largest automatically generated open-access collection of Slovene synonyms. It is sourced from the data in two principal language resources:...

Digital library and corpus of historical Slovene IMP 1.1

The IMP digital library contains historical Slovene books and other publications, together 658 texts with over 45,000 pages from the period 1584-1919. Each text contains...

Coreference in Universal Dependencies 1.4 (CorefUD 1.4)

CorefUD is a collection of previously existing coreference-annotated datasets that have been converted to a unified annotation scheme. In its current version (1.4), CorefUD...

List of potentially non-standard vocabulary candidates MEZZANINE-NstdLex 1.0

MEZZANINE-NstdLex is a dataset containing 4,237 potentially non-standard vocabulary candidates from the Sloleks Morphological Lexicon of Slovene (collected from among the...

Slovene instruction-following dataset for large language models GaMS-Instruct...

GaMS-Instruct-PHARMA is an instruction-following dataset designed to fine-tune Slovene large language models to follow instructions in the medical domain, particularly in the...

Morphological lexicon Franček

Morphological Lexicon Franček for Slovenian language contains non-stressed inflected word forms for 96,402 entries (out of 100,006 total) of the Franček Portal Headword List....

Morphological Lexicon of Slovene Sloleks 3.1

Sloleks is a reference morphological lexicon of Slovene that was developed to be used in various NLP applications and language manuals. It contains Slovene lemmas, their...

Monitor corpus of Slovene Trendi 2025-12

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 59 publishers. Trendi 2025-12 covers the period from January...

Linguistically annotated multilingual comparable corpora of parliamentary deb...

ParlaMint 5.0 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and...

4,930 datasets found