-
The Kola Peninsula Spoken Corpus (KoPeSC) 1: Spoken Corpus to “Речь поморов Т...
The Kola Peninsula Spoken Corpus (KoPeSC) is a dataset of sound recordings and their transcriptions in ELAN of Pomor Russian dialect speech and of Sámi and Russian speech as... -
Corpus of spoken Slovenian ROG-Dialog 1.0
Corpus of spoken Slovenian ROG-Dialog consists of volunteered audio, recorded by students by asking their relatives or acquaintances to talk on record in their homes. The... -
Corpus of conversational humor Krohot 1.0
The KROHOT corpus consists of 10 audio recordings of private, spontaneous conversations between two or three speakers, with a total duration of 232 minutes. Most recordings were... -
The "Mobile languages" corpus MoJezik 1.0 (audio)
The "Mobile Languages" corpus documents in-depth, semi-structured sociolinguistic interviews with speakers from two Slovene regions and distinctive dialects: Idrija (Cerkno... -
The "Mobile languages" corpus MoJezik 1.0 (transcription)
The "Mobile Languages" corpus documents in-depth, semi-structured sociolinguistic interviews with speakers from two Slovene regions and distinctive dialects: Idrija (Cerkno... -
Business English learner speech corpus SAPS
SAPS is a specialized speech corpus which contains business meeting simulations in English between undergraduate students of Languages for Business and Economics at the School... -
Prague Dependency Treebank of Spoken Czech 2.0 (PDTSC 2.0)
The Prague Dependency Treebank of Spoken Czech 2.0 (PDTSC 2.0) is a corpus of spoken language, consisting of 742,316 tokens and 73,835 sentences, representing 7,324 minutes... -
Spoken corpus of Karel Makoň
Talks of Karel Makoň given to his friends in the course of late sixties through early nineties of the 20th century. The topic is mostly christian mysticism. -
Prague DaTabase of Spoken Czech 1.0
PDTSC 1.0 is a multi-purpose corpus of spoken language. 768,888 tokens, 73,374 sentences and 7,324 minutes of spontaneous dialog speech have been recorded, transcribed and... -
Vystadial 2013 – English data
Vystadial 2013 is a dataset of telephone conversations in English and Czech, developed for training acoustic models for automatic speech recognition in spoken dialogue systems.... -
Prague Dependency Treebank - Consolidated 2.0 (PDT-C 2.0)
A manually annotated and genre-diversified language resource with rich linguistic information from morphology and syntax to semantics, the Prague Dependency Treebank –... -
Vystadial 2016 – Czech data
This is the Czech data collected during the VYSTADIAL project. It is an extension of the 'Vystadial 2013' Czech part data release. The dataset comprises of telephone... -
Prague Dependency Treebank - Consolidated 1.0 (PDT-C 1.0)
A richly annotated and genre-diversified language resource, The Prague Dependency Treebank – Consolidated 1.0 (PDT-C 1.0, or PDT-C in short in the sequel) is a consolidated... -
Vystadial 2013 – Czech data
Vystadial 2013 is a dataset of telephone conversations in English and Czech, developed for training acoustic models for automatic speech recognition in spoken dialogue systems.... -
Spoken corpus Gos 2.1 (transcriptions)
The spoken corpus Gos 2.1 is the reference speech corpus of the Slovenian language. This second edition contains about 300 hours of speech, or 2.4 million words, 127 thousand... -
Spoken corpus Gos VideoLectures 2.0 (audio)
Gos VideoLectures is an add-on to the Gos reference corpus of spoken Slovene (http://hdl.handle.net/11356/1040), and covers public academic speech. The Gos VideoLectures corpus... -
Frequency lists of word parts from the GOS 1.0 corpus 1.1
Frequency lists of words split into word parts were extracted from the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040) using the LIST corpus extraction tool... -
Spoken Torlak dialect corpus 1.0 (transcription)
Torlak corpus represents a spoken variety of the endangered Torlak dialect from the Timok area in Southeast Serbia. It comprises transcripts of interviews with the local... -
Frequency lists of word-level n-grams from the GOS 1.0 corpus 1.1
Frequency lists of word-level n-grams (or word sets) were extracted from the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040) using the LIST corpus extraction... -
TED-ELH Parallel Corpus (ELEXIS)
The corpus contains parallelly aligned scripts of TED Talks in English, Lithuanian, and Hebrew. It contains spoken language data. See also: http://hdl.handle.net/20.500.11821/34
