Dataset - B2FIND

Tourism Corpus TURK 3.0

The Tourism Corpus TURK 3.0 is a multilingual corpus of tourism-related texts in Slovenian, accompanied by some texts (about 6% of the corpus) in English, Italian and German....

Lithuanian Hate Speech Corpus v.1

This corpus consists of (1) examples of hate speech based on ethnicity, nationality, or race, and (2) a collection of neutral comments, including both general comments and...

MultiCo-Hub: a corpus of multimodal enrichments with motion-trajectory annota...

MultiCo-Hub is a multimodal dataset including 11 zipped subsets (henceforth: sessions) of time-aligned audio, video and motion-capture–derived BVH data, together with...

Slovene instruction-following dataset for large language models GaMS-Instruct...

GaMS-Instruct-MED is an instruction-following dataset designed to fine-tune Slovene large language models to follow instructions in the medical domain. It consists of units of...

Parallel sense-annotated corpus ELEXIS-WSD 1.3

ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.3 contains sentences for 10...

Corpus of Transcriptions - part 2

The second part of the Corpus of Transcriptions contains phonemic transcriptions of a short passage from Lecumberri and Maidment (2000, p. 78) performed by the undergraduate...

DiPSS - longitudinal corpus of drift in Polish students of Spanish

The DiPSS corpus (part 1) is a longitudinal speech resource documenting the phonetic productions of L1 Polish students learning L2 English and L3 Spanish. It includes recordings...

The Kola Peninsula Spoken Corpus (KoPeSC) 1: Spoken Corpus to “Речь поморов Т...

The Kola Peninsula Spoken Corpus (KoPeSC) is a dataset of sound recordings and their transcriptions in ELAN of Pomor Russian dialect speech and of Sámi and Russian speech as...

Corpus of spoken Slovenian ROG-Dialog 1.0

Corpus of spoken Slovenian ROG-Dialog consists of volunteered audio, recorded by students by asking their relatives or acquaintances to talk on record in their homes. The...

Monitor corpus of Slovene Trendi 2025-11

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 59 publishers. Trendi 2025-11 covers the period from January...

Monitor corpus of Slovene Trendi 2025-10

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 58 publishers. Trendi 2025-10 covers the period from January...

DigiDiaDem Speech-Cognitive Dataset (DSCD-CZ-2)

An updated and expanded version of the dataset was created to investigate the speech and cognitive performance of people with varying degrees of cognitive impairment, primarily...

Russian Media Corpus on the Harris–Trump Debate (RMC_HTD)

Russian Media Corpus on the Harris–Trump Debate contains metadata from Russian-language news articles reporting on the presidential debate between Kamala Harris and Donald...

Corpus of conversational humor Krohot 1.0

The KROHOT corpus consists of 10 audio recordings of private, spontaneous conversations between two or three speakers, with a total duration of 232 minutes. Most recordings were...

CMC training corpus Janes-Tag 2.1

Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence...

CMC training corpus Janes-Norm 1.2

Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation,...

Slovene learner corpus KOST 2.1

The corpus of Slovene as a foreign language KOST (Korpus slovenščine kot tujega jezika) contains 10,590 texts (almost 1.4 million words) written by adult speakers for whom...

Comparable corpus of parliamentary debates ParlaMint-ES-CN 1.0

The ParlaMint-ES-CN corpus is the contribution of the Parliament of the Canary Islands (Parlamento de Canarias) to the ParlaMint collection of comparable parliamentary corpora...

Monitor corpus of Slovene Trendi 2025-09

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 58 publishers. Trendi 2025-09 covers the period from January...

Slovene learner corpus KOST 2.0

The corpus of Slovene as a foreign language KOST (Korpus slovenščine kot tujega jezika) contains 8,347 texts (almost 1.3 million words) written by adult speakers for whom...

2,943 datasets found