-
Tourism Corpus TURK 3.0
The Tourism Corpus TURK 3.0 is a multilingual corpus of tourism-related texts in Slovenian, accompanied by some texts (about 6% of the corpus) in English, Italian and German.... -
Lithuanian Hate Speech Corpus v.1
This corpus consists of (1) examples of hate speech based on ethnicity, nationality, or race, and (2) a collection of neutral comments, including both general comments and... -
MultiCo-Hub: a corpus of multimodal enrichments with motion-trajectory annota...
MultiCo-Hub is a multimodal dataset including 11 zipped subsets (henceforth: sessions) of time-aligned audio, video and motion-capture–derived BVH data, together with... -
Slovene instruction-following dataset for large language models GaMS-Instruct...
GaMS-Instruct-MED is an instruction-following dataset designed to fine-tune Slovene large language models to follow instructions in the medical domain. It consists of units of... -
Parallel sense-annotated corpus ELEXIS-WSD 1.3
ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.3 contains sentences for 10... -
Corpus of Transcriptions - part 2
The second part of the Corpus of Transcriptions contains phonemic transcriptions of a short passage from Lecumberri and Maidment (2000, p. 78) performed by the undergraduate... -
DiPSS - longitudinal corpus of drift in Polish students of Spanish
The DiPSS corpus (part 1) is a longitudinal speech resource documenting the phonetic productions of L1 Polish students learning L2 English and L3 Spanish. It includes recordings... -
The Kola Peninsula Spoken Corpus (KoPeSC) 1: Spoken Corpus to “Речь поморов Т...
The Kola Peninsula Spoken Corpus (KoPeSC) is a dataset of sound recordings and their transcriptions in ELAN of Pomor Russian dialect speech and of Sámi and Russian speech as... -
Corpus of spoken Slovenian ROG-Dialog 1.0
Corpus of spoken Slovenian ROG-Dialog consists of volunteered audio, recorded by students by asking their relatives or acquaintances to talk on record in their homes. The... -
Monitor corpus of Slovene Trendi 2025-11
The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 59 publishers. Trendi 2025-11 covers the period from January... -
Monitor corpus of Slovene Trendi 2025-10
The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 58 publishers. Trendi 2025-10 covers the period from January... -
DigiDiaDem Speech-Cognitive Dataset (DSCD-CZ-2)
An updated and expanded version of the dataset was created to investigate the speech and cognitive performance of people with varying degrees of cognitive impairment, primarily... -
Russian Media Corpus on the Harris–Trump Debate (RMC_HTD)
Russian Media Corpus on the Harris–Trump Debate contains metadata from Russian-language news articles reporting on the presidential debate between Kamala Harris and Donald... -
Corpus of conversational humor Krohot 1.0
The KROHOT corpus consists of 10 audio recordings of private, spontaneous conversations between two or three speakers, with a total duration of 232 minutes. Most recordings were... -
CMC training corpus Janes-Tag 2.1
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence... -
CMC training corpus Janes-Norm 1.2
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation,... -
Slovene learner corpus KOST 2.1
The corpus of Slovene as a foreign language KOST (Korpus slovenščine kot tujega jezika) contains 10,590 texts (almost 1.4 million words) written by adult speakers for whom... -
Comparable corpus of parliamentary debates ParlaMint-ES-CN 1.0
The ParlaMint-ES-CN corpus is the contribution of the Parliament of the Canary Islands (Parlamento de Canarias) to the ParlaMint collection of comparable parliamentary corpora... -
Monitor corpus of Slovene Trendi 2025-09
The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 58 publishers. Trendi 2025-09 covers the period from January... -
Slovene learner corpus KOST 2.0
The corpus of Slovene as a foreign language KOST (Korpus slovenščine kot tujega jezika) contains 8,347 texts (almost 1.3 million words) written by adult speakers for whom...
