Dataset - B2FIND

Interview met Arjan van Hessen, online, 2 april 2025

Interview voor een Engelstalig hoofdstuk in het boek Artworks through Oral History, voortkomend uit het door het PDI-SSH ondersteunde onderzoeksproject, Oral History–Stories at...

Modelling word learning and recognition using visually grounded speech

A set of recorded isolated nouns, verbs and image annotations used for testing the word recognition performance of our speech2image model. We trained a word recognition model...

ASR model evaluator

Docker image with ASR evaluation tool that has support for WER calculation on punctuated and capitalised transcripts. The UI allows uploading the reference and predicted...

Prague Dependency Treebank of Spoken Czech 2.0 (PDTSC 2.0)

The Prague Dependency Treebank of Spoken Czech 2.0 (PDTSC 2.0) is a corpus of spoken language, consisting of 742,316 tokens and 73,835 sentences, representing 7,324 minutes...

A Speech Test Set of Practice Business Presentations with Additional Relevant...

We present a test corpus of audio recordings and transcriptions of presentations of students' enterprises together with their slides and web-pages. The corpus is intended for...

UFAL Speech Corpus of North Levantine Arabic 1.0 - Part 2

The corpus contains recordings by the native speakers of the North Levantine Arabic (apc) acquired during 2020, 2021, and 2023 in Prague, Paris, Kabardia, and St. Petersburg....

UFAL Speech Corpus of North Levantine Arabic 1.0 - Part 3

The corpus contains recordings by the native speakers of the North Levantine Arabic (apc) acquired during 2020, 2021, and 2023 in Prague, Paris, Kabardia, and St. Petersburg....

Prague DaTabase of Spoken Czech 1.0

PDTSC 1.0 is a multi-purpose corpus of spoken language. 768,888 tokens, 73,374 sentences and 7,324 minutes of spontaneous dialog speech have been recorded, transcribed and...

UFAL Speech Corpus of North Levantine Arabic 1.0 - Part 1

The corpus contains recordings by the native speakers of the North Levantine Arabic (apc) acquired during 2020, 2021, and 2023 in Prague, Paris, Kabardia, and St. Petersburg....

Prague Dependency Treebank - Consolidated 2.0 (PDT-C 2.0)

A manually annotated and genre-diversified language resource with rich linguistic information from morphology and syntax to semantics, the Prague Dependency Treebank –...

Speech Processing, Recognition and Automatic Annotation Kit (SPRAAK)

SPRAAK (also Dutch for 'speech') is a speech recognition package. As such it is useful for transcription of speech, alignment of spoken and written language, annotation of...

STAZKA – Speech recordings from vehicles

The database actually contains two sets of recordings, both recorded in the moving or stationary vehicles (passenger cars or trucks). All data were recorded within the project...

Prague Dependency Treebank - Consolidated 1.0 (PDT-C 1.0)

A richly annotated and genre-diversified language resource, The Prague Dependency Treebank – Consolidated 1.0 (PDT-C 1.0, or PDT-C in short in the sequel) is a consolidated...

The "Mići Princ" text and speech dataset of Chakavian micro-dialects

The Mići Princ "text and speech" dialectal dataset is a word-aligned version of the translation of The Little Prince into various Chakavian micro-dialects, released by the...

Spoken corpus Gos VideoLectures 2.0 (audio)

Gos VideoLectures is an add-on to the Gos reference corpus of spoken Slovene (http://hdl.handle.net/11356/1040), and covers public academic speech. The Gos VideoLectures corpus...

Spoken corpus Gos VideoLectures 3.0 (transcription)

Gos VideoLectures is an add-on to the Gos reference corpus of spoken Slovene (http://hdl.handle.net/11356/1040), and covers public academic speech. The Gos VideoLectures corpus...

Spoken corpus Gos VideoLectures 1.0 (transcription)

Gos Videolectures is an add-on to the Gos reference speech corpus of Slovene (http://hdl.handle.net/11356/1040), and covers public academic speech. The Gos Videolectures...

Speech Database of Spoken Flight Information Enquiries SOFES 1.0

The SOFES speech database (Spoken Flight Enquiries in Slovene) is a collection of transcribed and segmented audio recordings of spoken flight-information enquiries in Slovene....

Spoken corpus Gos VideoLectures 3.0 (audio)

Gos VideoLectures is an add-on to the Gos reference corpus of spoken Slovene (http://hdl.handle.net/11356/1040), and covers public academic speech. The Gos VideoLectures corpus...

Spoken corpus Gos VideoLectures 1.0 (audio)

Gos VideoLectures is an add-on to the Gos reference speech corpus of Slovene (http://hdl.handle.net/11356/1040), and covers public academic speech. The Gos Videolectures...

40 datasets found