CLARIN - Repositories

Comprehensive Slovenian-Hungarian Dictionary 2.0

The Comprehensive Slovenian-Hungarian dictionary is a general bilingual dictionary that is being compiled at the Centre for Language Resources and Technologies of the University...

Collocations Dictionary of Modern Slovene KSSS 2.0

The database of the Collocations Dictionary of Modern Slovene 2.0 contains 4,491,958 collocations in 81,443 entries. Collocations occur in 81 different syntactic relations....

Thesaurus of Modern Slovene 2.0

Thesaurus of Modern Slovene is the largest automatically generated open-access collection of Slovene synonyms. It is sourced from the data in two principal language resources:...

Coreference in Universal Dependencies 1.4 (CorefUD 1.4)

CorefUD is a collection of previously existing coreference-annotated datasets that have been converted to a unified annotation scheme. In its current version (1.4), CorefUD...

List of potentially non-standard vocabulary candidates MEZZANINE-NstdLex 1.0

MEZZANINE-NstdLex is a dataset containing 4,237 potentially non-standard vocabulary candidates from the Sloleks Morphological Lexicon of Slovene (collected from among the...

Slovene instruction-following dataset for large language models GaMS-Instruct...

GaMS-Instruct-PHARMA is an instruction-following dataset designed to fine-tune Slovene large language models to follow instructions in the medical domain, particularly in the...

Morphological lexicon Franček

Morphological Lexicon Franček for Slovenian language contains non-stressed inflected word forms for 96,402 entries (out of 100,006 total) of the Franček Portal Headword List....

Morphological Lexicon of Slovene Sloleks 3.1

Sloleks is a reference morphological lexicon of Slovene that was developed to be used in various NLP applications and language manuals. It contains Slovene lemmas, their...

Morphological lexicon Sloleks 3.0

Sloleks is a reference morphological lexicon of Slovene that was developed to be used in various NLP applications and language manuals. It contains Slovene lemmas, their...

Multiword Expressions lexicon extracted from the Gigafida 2.1 corpus

The MWE lexicon was extracted from the Gigafida 2.1 Corpus of Written Standard Slovene https://www.clarin.si/ske/#dashboard?corpname=gfida21) using specialized scripts for...

Frequency lists of collocations from the Gigafida 2.1 corpus

Frequency lists of collocations were extracted from the Gigafida 2.1 Corpus of Written Standard Slovene (https://www.clarin.si/ske/#dashboard?corpname=gfida21) using specialised...

Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography

This dataset contains data for testing machine translation and topic classification in Piedmontese. It is based on FLORES+ (NLLB Team et al., 2024) and SIB-200: A Simple,...

CRAC 2026 Empty Nodes Baseline Model

The crac2026_empty_nodes_baseline is a XLM-RoBERTa-large–based multilingual model for CRAC 2026 Empty Nodes Baseline system https://github.com/ufal/crac2026_empty_nodes_baseline...

Errant Extended Vocabulary

The ontology provides a FAIR, interoperable vocabulary for grammatical error annotation and correction, integrating the English-focused ERRANT taxonomy with Czech-specific...

Dataset of annotated collocation-distractor pairs COLLDIST

The dataset contains 59,598 collocation-distractor pairs for 2,856 headwords. Distractor is defined as an incorrect answer/alternative to collocation, which can be similar to...

Dataset of annotated headword-synonym-distractor triplets SYNDIST

The dataset contains 51,023 headword-synonym-distractor triplets for 5,000 headwords. Distractor is defined as an incorrect answer/alternative to synonym, which can be similar...

Wordnet for Definition Augmentation with Encoder-Decoder Architecture

Data augmentation is a difficult task in Natural Language Processing. Simple methods that can be relatively easily applied in other domains like insertion, deletion or...

Wordnet-oriented Recognition of Derivational Relations

Derivational relations are an important element in defining meanings, as they help to explore word-formation schemes and predict senses of derivates (derived words). In this...

EnglishWordNet 2020: Improving and Extending aWordNet for English using an Op...

The Princeton WordNet, while one of the most widely used resources for NLP, has not been updated for a long time, and as such a new project English WordNet has arisen to...

Towards a methodology for filtering out gaps and mismatches across wordnets: ...

This paper presents the results of large-scale noun synset mapping between plWordNet, the wordnet of Polish, and Princeton WordNet, the wordnet of English, which have shown high...

1,492 datasets found