CLARIN - Repositories

Collocations Dictionary of Modern Slovene KSSS 2.0

The database of the Collocations Dictionary of Modern Slovene 2.0 contains 4,491,958 collocations in 81,443 entries. Collocations occur in 81 different syntactic relations....

Thesaurus of Modern Slovene 2.0

Thesaurus of Modern Slovene is the largest automatically generated open-access collection of Slovene synonyms. It is sourced from the data in two principal language resources:...

Digital library and corpus of historical Slovene IMP 1.1

The IMP digital library contains historical Slovene books and other publications, together 658 texts with over 45,000 pages from the period 1584-1919. Each text contains...

Coreference in Universal Dependencies 1.4 (CorefUD 1.4)

CorefUD is a collection of previously existing coreference-annotated datasets that have been converted to a unified annotation scheme. In its current version (1.4), CorefUD...

List of potentially non-standard vocabulary candidates MEZZANINE-NstdLex 1.0

MEZZANINE-NstdLex is a dataset containing 4,237 potentially non-standard vocabulary candidates from the Sloleks Morphological Lexicon of Slovene (collected from among the...

Slovene instruction-following dataset for large language models GaMS-Instruct...

GaMS-Instruct-PHARMA is an instruction-following dataset designed to fine-tune Slovene large language models to follow instructions in the medical domain, particularly in the...

Morphological lexicon Franček

Morphological Lexicon Franček for Slovenian language contains non-stressed inflected word forms for 96,402 entries (out of 100,006 total) of the Franček Portal Headword List....

Morphological Lexicon of Slovene Sloleks 3.1

Sloleks is a reference morphological lexicon of Slovene that was developed to be used in various NLP applications and language manuals. It contains Slovene lemmas, their...

Monitor corpus of Slovene Trendi 2025-12

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 59 publishers. Trendi 2025-12 covers the period from January...

Linguistically annotated multilingual comparable corpora of parliamentary deb...

ParlaMint 5.0 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and...

Multilingual comparable corpora of parliamentary debates ParlaMint 5.0

ParlaMint 5.0 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and...

Morphological lexicon Sloleks 3.0

Sloleks is a reference morphological lexicon of Slovene that was developed to be used in various NLP applications and language manuals. It contains Slovene lemmas, their...

Multiword Expressions lexicon extracted from the Gigafida 2.1 corpus

The MWE lexicon was extracted from the Gigafida 2.1 Corpus of Written Standard Slovene https://www.clarin.si/ske/#dashboard?corpname=gfida21) using specialized scripts for...

Valency lexicon extracted from the Gigafida 2.1 corpus

The valency lexicon was extracted from the Gigafida 2.1 Corpus of Written Standard Slovene (https://www.clarin.si/ske/#dashboard?corpname=gfida21) using specialized scripts for...

Frequency lists of collocations from the Gigafida 2.1 corpus

Frequency lists of collocations were extracted from the Gigafida 2.1 Corpus of Written Standard Slovene (https://www.clarin.si/ske/#dashboard?corpname=gfida21) using specialised...

Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography

This dataset contains data for testing machine translation and topic classification in Piedmontese. It is based on FLORES+ (NLLB Team et al., 2024) and SIB-200: A Simple,...

Lithuanian-English Parallel Cybersecurity Corpus – DVITAS

Lithuanian-English Parallel Cybersecurity Corpus consists of official cybersecurity documents of the Republic of Lithuania and their English translations, dating from 2014 to...

CRAC 2026 Empty Nodes Baseline Model

The crac2026_empty_nodes_baseline is a XLM-RoBERTa-large–based multilingual model for CRAC 2026 Empty Nodes Baseline system https://github.com/ufal/crac2026_empty_nodes_baseline...

Errant Extended Vocabulary

The ontology provides a FAIR, interoperable vocabulary for grammatical error annotation and correction, integrating the English-focused ERRANT taxonomy with Czech-specific...

Sõnaveeb 2025. EKI keeleportaal Language portal Sõnaveeb 2025

Sõnaveeb on Eesti Keele Instituudi uus sõnastikuportaal, kuhu on koondatud keeleinfo instituudi paljudest sõnakogudest ja andmebaasidest. More info at https://sonaveeb.ee/...

4,938 datasets found