Dataset - B2FIND

Phonetic segmentation and acoustic measurements of spoken Slovenian SloPhonSe...

SloPhonSeg 1.0 is a dataset of automatically generated phonetic segmentations and acoustic-phonetic measurements for selected recordings and transcriptions from the spoken...

Training corpus of spoken Slovenian ROG 1.1

Training corpus of spoken Slovenian ROG 1.1 is an improved version of the ROG 1.0 corpus (http://hdl.handle.net/11356/1992). The main differences between the original and the...

Training corpus of spoken Slovenian ROG 1.0

Training corpus of spoken Slovenian ROG 1.0 is the main resource for Slovenian language to train and evaluate technologies aimed at processing speech or speech transcripts, such...

The Kola Peninsula Spoken Corpus (KoPeSC) 1: Spoken Corpus to “Речь поморов Т...

The Kola Peninsula Spoken Corpus (KoPeSC) is a dataset of sound recordings and their transcriptions in ELAN of Pomor Russian dialect speech and of Sámi and Russian speech as...

Corpus of spoken Slovenian ROG-Dialog 1.0

Corpus of spoken Slovenian ROG-Dialog consists of volunteered audio, recorded by students by asking their relatives or acquaintances to talk on record in their homes. The...

Corpus of conversational humor Krohot 1.0

The KROHOT corpus consists of 10 audio recordings of private, spontaneous conversations between two or three speakers, with a total duration of 232 minutes. Most recordings were...

The "Mobile languages" corpus MoJezik 1.0 (audio)

The "Mobile Languages" corpus documents in-depth, semi-structured sociolinguistic interviews with speakers from two Slovene regions and distinctive dialects: Idrija (Cerkno...

The "Mobile languages" corpus MoJezik 1.0 (transcription)

The "Mobile Languages" corpus documents in-depth, semi-structured sociolinguistic interviews with speakers from two Slovene regions and distinctive dialects: Idrija (Cerkno...

Business English learner speech corpus SAPS

SAPS is a specialized speech corpus which contains business meeting simulations in English between undergraduate students of Languages for Business and Economics at the School...

Prague Dependency Treebank of Spoken Czech 2.0 (PDTSC 2.0)

The Prague Dependency Treebank of Spoken Czech 2.0 (PDTSC 2.0) is a corpus of spoken language, consisting of 742,316 tokens and 73,835 sentences, representing 7,324 minutes...

Spoken corpus of Karel Makoň

Talks of Karel Makoň given to his friends in the course of late sixties through early nineties of the 20th century. The topic is mostly christian mysticism.

Prague DaTabase of Spoken Czech 1.0

PDTSC 1.0 is a multi-purpose corpus of spoken language. 768,888 tokens, 73,374 sentences and 7,324 minutes of spontaneous dialog speech have been recorded, transcribed and...

Vystadial 2013 – English data

Vystadial 2013 is a dataset of telephone conversations in English and Czech, developed for training acoustic models for automatic speech recognition in spoken dialogue systems....

Prague Dependency Treebank - Consolidated 2.0 (PDT-C 2.0)

A manually annotated and genre-diversified language resource with rich linguistic information from morphology and syntax to semantics, the Prague Dependency Treebank –...

Vystadial 2016 – Czech data

This is the Czech data collected during the VYSTADIAL project. It is an extension of the 'Vystadial 2013' Czech part data release. The dataset comprises of telephone...

Prague Dependency Treebank - Consolidated 1.0 (PDT-C 1.0)

A richly annotated and genre-diversified language resource, The Prague Dependency Treebank – Consolidated 1.0 (PDT-C 1.0, or PDT-C in short in the sequel) is a consolidated...

Vystadial 2013 – Czech data

Vystadial 2013 is a dataset of telephone conversations in English and Czech, developed for training acoustic models for automatic speech recognition in spoken dialogue systems....

Spoken corpus Gos 2.1 (transcriptions)

The spoken corpus Gos 2.1 is the reference speech corpus of the Slovenian language. This second edition contains about 300 hours of speech, or 2.4 million words, 127 thousand...

Spoken corpus Gos VideoLectures 2.0 (audio)

Gos VideoLectures is an add-on to the Gos reference corpus of spoken Slovene (http://hdl.handle.net/11356/1040), and covers public academic speech. The Gos VideoLectures corpus...

Frequency lists of word parts from the GOS 1.0 corpus 1.1

Frequency lists of words split into word parts were extracted from the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040) using the LIST corpus extraction tool...

56 datasets found