-
Corpus of Transcriptions - part 2
The second part of the Corpus of Transcriptions contains phonemic transcriptions of a short passage from Lecumberri and Maidment (2000, p. 78) performed by the undergraduate... -
plWordNet 5.0 – challenges of a life-long wordnet development process
The construction of plWordNet began in 2005 and has been continued since then. In this paper we present the latest 5.0 version and describe the challenges connected with a... -
DiPSS - longitudinal corpus of drift in Polish students of Spanish
The DiPSS corpus (part 1) is a longitudinal speech resource documenting the phonetic productions of L1 Polish students learning L2 English and L3 Spanish. It includes recordings... -
HANOI corpus and tool for analysis of note-taking of conference interpreters
HANOI is a resource for understanding the process of consecutive interpreting through the analysis of the note-taking process. Each data package is a record of an interpretation... -
MultiCo-Hub: a corpus of multimodal enrichments with motion-trajectory annota...
MultiCo-Hub is a multimodal dataset including 11 zipped subsets (henceforth: sessions) of time-aligned audio, video and motion-capture–derived BVH data, together with... -
The Kola Peninsula Spoken Corpus (KoPeSC) 1: Spoken Corpus to “Речь поморов Т...
The Kola Peninsula Spoken Corpus (KoPeSC) is a dataset of sound recordings and their transcriptions in ELAN of Pomor Russian dialect speech and of Sámi and Russian speech as... -
Corpus of spoken Slovenian ROG-Dialog 1.0
Corpus of spoken Slovenian ROG-Dialog consists of volunteered audio, recorded by students by asking their relatives or acquaintances to talk on record in their homes. The... -
Monitor corpus of Slovene Trendi 2025-11
The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 59 publishers. Trendi 2025-11 covers the period from January... -
Monitor corpus of Slovene Trendi 2025-10
The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 58 publishers. Trendi 2025-10 covers the period from January... -
DigiLing e-Learning Hub: e-Courses for Digital Linguistics
The files represent exported e-learning resources created within the DigiLing project, www.digiling.eu. We have identified seven core subjects in Digital Linguistics and built... -
Quality of Working Life 2023
A regular survey conducted as part of the long-term monitoring of the quality of working life in the Czech Republic, carried out using the research tool SQWLi... -
DigiDiaDem Speech-Cognitive Dataset (DSCD-CZ-2)
An updated and expanded version of the dataset was created to investigate the speech and cognitive performance of people with varying degrees of cognitive impairment, primarily... -
Quality of Working Life 2024
A regular survey conducted as part of the long-term monitoring of the quality of working life in the Czech Republic, carried out using the research tool SQWLi... -
Czech PDT-C 2.0 Model for UDPipe 2 (2025-10-25)
Tokenizer, POS Tagger, Lemmatizer, and Parser model based on the PDT-C 2.0 treebank (http://hdl.handle.net/11234/1-5813). The model documentation including performance can be... -
Universal Dependencies 2.17 models for UDPipe 2 (2025-11-25)
Tokenizer, POS Tagger, Lemmatizer and Parser models for 169 treebanks of 93 languages of Universal Depenencies 2.17 Treebanks, created solely using UD 2.17 data... -
Russian Media Corpus on the Harris–Trump Debate (RMC_HTD)
Russian Media Corpus on the Harris–Trump Debate contains metadata from Russian-language news articles reporting on the presidential debate between Kamala Harris and Donald... -
Quality of Working Life 2022
A regular survey conducted as part of the long-term monitoring of the quality of working life in the Czech Republic, carried out using the research tool SQWLi... -
Corpus of conversational humor Krohot 1.0
The KROHOT corpus consists of 10 audio recordings of private, spontaneous conversations between two or three speakers, with a total duration of 232 minutes. Most recordings were... -
CMC training corpus Janes-Tag 2.1
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence... -
CMC training corpus Janes-Norm 1.2
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation,...
