CLARIN - Repositories

DigiLing e-Learning Hub: e-Courses for Digital Linguistics

The files represent exported e-learning resources created within the DigiLing project, www.digiling.eu. We have identified seven core subjects in Digital Linguistics and built...

Quality of Working Life 2023

A regular survey conducted as part of the long-term monitoring of the quality of working life in the Czech Republic, carried out using the research tool SQWLi...

DigiDiaDem Speech-Cognitive Dataset (DSCD-CZ-2)

An updated and expanded version of the dataset was created to investigate the speech and cognitive performance of people with varying degrees of cognitive impairment, primarily...

Quality of Working Life 2024

A regular survey conducted as part of the long-term monitoring of the quality of working life in the Czech Republic, carried out using the research tool SQWLi...

Czech PDT-C 2.0 Model for UDPipe 2 (2025-10-25)

Tokenizer, POS Tagger, Lemmatizer, and Parser model based on the PDT-C 2.0 treebank (http://hdl.handle.net/11234/1-5813). The model documentation including performance can be...

Universal Dependencies 2.17 models for UDPipe 2 (2025-11-25)

Tokenizer, POS Tagger, Lemmatizer and Parser models for 169 treebanks of 93 languages of Universal Depenencies 2.17 Treebanks, created solely using UD 2.17 data...

Russian Media Corpus on the Harris–Trump Debate (RMC_HTD)

Russian Media Corpus on the Harris–Trump Debate contains metadata from Russian-language news articles reporting on the presidential debate between Kamala Harris and Donald...

Quality of Working Life 2022

A regular survey conducted as part of the long-term monitoring of the quality of working life in the Czech Republic, carried out using the research tool SQWLi...

Corpus of conversational humor Krohot 1.0

The KROHOT corpus consists of 10 audio recordings of private, spontaneous conversations between two or three speakers, with a total duration of 232 minutes. Most recordings were...

CMC training corpus Janes-Tag 2.1

Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence...

CMC training corpus Janes-Norm 1.2

Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation,...

Frequency List of Lithuanian Homoforms

The list contains 63,139 homoforms. In the Frequency List of Lithuanian Homoforms, the following data are provided for each homoform: 1. the homoform itself, 2) its lemma (or...

Slovene-Japanese Learner's Dictionary sloJa 1.1

The Slovenian-Japanese online dictionary for Slovenian speaking learners of Japanese was compiled by extracting and converting the Japanese-Slovenian dictionary jaSlo 3.1...

Ontology of topics for Slovenian as a second and foreign language ONTEM 1.0

ONTEM 1.0 comprises 1,019 manually prepared entries, each consisting of information about the lemma, part-of-speech (following the MULTEXT-East tagset for Slovenian,...

Slovene learner corpus KOST 2.1

The corpus of Slovene as a foreign language KOST (Korpus slovenščine kot tujega jezika) contains 10,590 texts (almost 1.4 million words) written by adult speakers for whom...

Comparable corpus of parliamentary debates ParlaMint-ES-CN 1.0

The ParlaMint-ES-CN corpus is the contribution of the Parliament of the Canary Islands (Parlamento de Canarias) to the ParlaMint collection of comparable parliamentary corpora...

Monitor corpus of Slovene Trendi 2025-09

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 58 publishers. Trendi 2025-09 covers the period from January...

Slovene-Japanese Learner's Dictionary sloJa 1.0

The Slovenian-Japanese online dictionary for Slovenian speaking learners of Japanese was compiled by extracting and converting the Japanese-Slovenian dictionary jaSlo 3.1...

Slovene learner corpus KOST 2.0

The corpus of Slovene as a foreign language KOST (Korpus slovenščine kot tujega jezika) contains 8,347 texts (almost 1.3 million words) written by adult speakers for whom...

Corpus of scientific texts of contemporary Slovenian KZB 1.0

The Corpus of scientific texts of contemporary Slovenian consists of 25 million words from scientific monographs and scientific papers written mainly between 2000 and 2023. It...

4,938 datasets found