CLARIN - Repositories

Ekilex 2025. EKI sõnastiku- ja terminibaasisüsteem

Eesti Keele Instituudi sõnastiku- ja terminibaasisüsteem Ekilex on loodud sõnastike ja terminibaaside koostamiseks ja ajakohastamiseks leksikograafidele, terminoloogidele ning...

Monitor corpus of Slovene Trendi 2025-11

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 59 publishers. Trendi 2025-11 covers the period from January...

Little Big Translation Literature – Czech and German Translations of Yiddish ...

In order to make the process of preparing analyses for a planned monograph about Czech and German translations of Yiddish texts transparent, five source texts were transcribed...

CantusCorpus v1.0

CantusCorpus 1.0 is a large dataset of Gregorian chant intended for computational research. The dataset consists of all chants that are accessible through the Cantus Index...

Judikatura 2024

A corpus from court decisions of three main courts in the Czech Republic (namely Supreme Court, Supreme Administrative Court and Constitutional Court). The corpus is tagged...

CCLL Lemmatised Frequency Lists

The resource contains 6 frequency lists for the Corpus of Contemporary Lithuanian language (CCLL) (https://sitti.vdu.lt/en/services/) 1-LT_token_freq_list.txt - a full frequency...

Lithuanian Science and Research Terminology: Multilingual Term List

Tab-separated (TSV) UTF-8 text file containing 223 Lithuanian science and research terms with definitions and translation equivalents in English, German, and French. Intended...

Morphological lexicon Sloleks 2.0

Sloleks is the reference morphological lexicon for Slovenian language, developed to be used in NLP applications and language manuals. Encoded in LMF XML, the lexicon contains...

Morphological lexicon Sloleks 1.2

Sloleks is the reference morphological lexicon for Slovenian language, developed to be used in NLP applications and language manuals. Encoded in LMF XML, the lexicon contains...

Dataset of annotated collocation-distractor pairs COLLDIST

The dataset contains 59,598 collocation-distractor pairs for 2,856 headwords. Distractor is defined as an incorrect answer/alternative to collocation, which can be similar to...

Terminological dictionary of papermaking

This digital dictionary of papermaking was made on the basis of the printed edition, i.e. Marjeta Humar (ed.) Papirniški terminološki slovar. 1996. ZRC SAZU...

Dataset of annotated headword-synonym-distractor triplets SYNDIST

The dataset contains 51,023 headword-synonym-distractor triplets for 5,000 headwords. Distractor is defined as an incorrect answer/alternative to synonym, which can be similar...

Wordnet for Definition Augmentation with Encoder-Decoder Architecture

Data augmentation is a difficult task in Natural Language Processing. Simple methods that can be relatively easily applied in other domains like insertion, deletion or...

Wordnet-oriented Recognition of Derivational Relations

Derivational relations are an important element in defining meanings, as they help to explore word-formation schemes and predict senses of derivates (derived words). In this...

EnglishWordNet 2020: Improving and Extending aWordNet for English using an Op...

The Princeton WordNet, while one of the most widely used resources for NLP, has not been updated for a long time, and as such a new project English WordNet has arisen to...

Towards a methodology for filtering out gaps and mismatches across wordnets: ...

This paper presents the results of large-scale noun synset mapping between plWordNet, the wordnet of Polish, and Princeton WordNet, the wordnet of English, which have shown high...

A (Non)-Perfect Match: Mapping plWordNet onto Princeton WordNet

The paper reports on the methodology and final results of a large-scale synset mapping between plWordNet and Princeton WordNet. Dedicated manual and semi-automatic mapping...

Lexical Perspective on Wordnet to Wordnet Mapping

The paper presents a feature-based model of equivalence targeted at (manual) sense linking between Princeton WordNet and plWordNet. The model incorporates insights from...

Wordnet-based Evaluation of Large Distributional Models for Polish

The paper presents construction of large scale test datasets for word embeddings on the basis of a very large wordnet. They were next applied for evaluation of word embedding...

plWordNet 3.0 – Almost There

It took us nearly ten years to get from no wordnet for Polish to the largest wordnet ever built. We started small but quickly learned to dream big. Now we are about to release...

4,938 datasets found