-
Comprehensive Slovenian-Hungarian Dictionary 2.0
The Comprehensive Slovenian-Hungarian dictionary is a general bilingual dictionary that is being compiled at the Centre for Language Resources and Technologies of the University... -
Collocations Dictionary of Modern Slovene KSSS 2.0
The database of the Collocations Dictionary of Modern Slovene 2.0 contains 4,491,958 collocations in 81,443 entries. Collocations occur in 81 different syntactic relations.... -
Thesaurus of Modern Slovene 2.0
Thesaurus of Modern Slovene is the largest automatically generated open-access collection of Slovene synonyms. It is sourced from the data in two principal language resources:... -
Coreference in Universal Dependencies 1.4 (CorefUD 1.4)
CorefUD is a collection of previously existing coreference-annotated datasets that have been converted to a unified annotation scheme. In its current version (1.4), CorefUD... -
List of potentially non-standard vocabulary candidates MEZZANINE-NstdLex 1.0
MEZZANINE-NstdLex is a dataset containing 4,237 potentially non-standard vocabulary candidates from the Sloleks Morphological Lexicon of Slovene (collected from among the... -
Slovene instruction-following dataset for large language models GaMS-Instruct...
GaMS-Instruct-PHARMA is an instruction-following dataset designed to fine-tune Slovene large language models to follow instructions in the medical domain, particularly in the... -
Morphological lexicon Franček
Morphological Lexicon Franček for Slovenian language contains non-stressed inflected word forms for 96,402 entries (out of 100,006 total) of the Franček Portal Headword List.... -
Morphological Lexicon of Slovene Sloleks 3.1
Sloleks is a reference morphological lexicon of Slovene that was developed to be used in various NLP applications and language manuals. It contains Slovene lemmas, their... -
Morphological lexicon Sloleks 3.0
Sloleks is a reference morphological lexicon of Slovene that was developed to be used in various NLP applications and language manuals. It contains Slovene lemmas, their... -
Multiword Expressions lexicon extracted from the Gigafida 2.1 corpus
The MWE lexicon was extracted from the Gigafida 2.1 Corpus of Written Standard Slovene https://www.clarin.si/ske/#dashboard?corpname=gfida21) using specialized scripts for... -
Frequency lists of collocations from the Gigafida 2.1 corpus
Frequency lists of collocations were extracted from the Gigafida 2.1 Corpus of Written Standard Slovene (https://www.clarin.si/ske/#dashboard?corpname=gfida21) using specialised... -
Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography
This dataset contains data for testing machine translation and topic classification in Piedmontese. It is based on FLORES+ (NLLB Team et al., 2024) and SIB-200: A Simple,... -
CRAC 2026 Empty Nodes Baseline Model
The crac2026_empty_nodes_baseline is a XLM-RoBERTa-large–based multilingual model for CRAC 2026 Empty Nodes Baseline system https://github.com/ufal/crac2026_empty_nodes_baseline... -
Errant Extended Vocabulary
The ontology provides a FAIR, interoperable vocabulary for grammatical error annotation and correction, integrating the English-focused ERRANT taxonomy with Czech-specific... -
Dataset of annotated collocation-distractor pairs COLLDIST
The dataset contains 59,598 collocation-distractor pairs for 2,856 headwords. Distractor is defined as an incorrect answer/alternative to collocation, which can be similar to... -
Dataset of annotated headword-synonym-distractor triplets SYNDIST
The dataset contains 51,023 headword-synonym-distractor triplets for 5,000 headwords. Distractor is defined as an incorrect answer/alternative to synonym, which can be similar... -
Wordnet for Definition Augmentation with Encoder-Decoder Architecture
Data augmentation is a difficult task in Natural Language Processing. Simple methods that can be relatively easily applied in other domains like insertion, deletion or... -
Wordnet-oriented Recognition of Derivational Relations
Derivational relations are an important element in defining meanings, as they help to explore word-formation schemes and predict senses of derivates (derived words). In this... -
EnglishWordNet 2020: Improving and Extending aWordNet for English using an Op...
The Princeton WordNet, while one of the most widely used resources for NLP, has not been updated for a long time, and as such a new project English WordNet has arisen to... -
Towards a methodology for filtering out gaps and mismatches across wordnets: ...
This paper presents the results of large-scale noun synset mapping between plWordNet, the wordnet of Polish, and Princeton WordNet, the wordnet of English, which have shown high...
