Spoken corpora of parliamentary debates ParlaSpeech 3.0

Dataset

PID

The ParlaSpeech corpora are built from the transcripts of parliamentary proceedings of Croatian, Serbian, Polish, and Czech parliaments available in the ParlaMint 4.0 corpus (http://hdl.handle.net/11356/1859), and the parliamentary recordings available from the parliaments' YouTube channels. An instance is a transcript sentence with the corresponding metadata and the aligned audio.

This version of the ParlaSpeech corpora does not release the audio files as it covers the same data as the preceding versions, i.e. version 2.0 for HR (http://hdl.handle.net/11356/1914) and version 1.0 for RS (http://hdl.handle.net/11356/1834), PL (http://hdl.handle.net/11356/1686), and CZ (http://hdl.handle.net/11356/1785). This version's main extension are five enrichment layers:

* ParlaSpeech-Pause: automatic annotations of filled pauses ("eerm")
* ParlaSpeech-Align: precise word- and grapheme-level alignment (HR, RS only)
* ParlaSpeech-Stress: Labelled primary stress in multisyllabic words (HR, RS only)
* ParlaSpeech-Ling: Universal Dependencies (UD) formatted linguistic annotations (lemma, part-of-speech, syntax, etc.)
* ParlaSpeech-Senti: sentiment estimation based on the transcript

Data size per parliament is the following:

* Croatia (HR): 923k sentences, 3k hours, 324k filled pauses, 11M word stresses
* Serbia (RS): 291k sentences, 900 hours, 74k filled pauses, 2M word stresses
* Czechia (CZ): 718k sentences, 1.2k hours, 200k filled pauses, no word stresses
* Poland (PL): 535k sentences, 1k hours, 200k filled pauses, no word stresses

The data are available in the following formats:

* JSONL: master format, containing all the data. Distributed as a newline delimited JSON, where each line is a valid JSON serialization. Mostly intended for computerized processing.
* VERT: vertical format intended for concordancers with text, links to audio, linguistic annotations, sentiment, filled pauses, and primary word stress (where available).
* TextGrid (HR and RS only): word- and grapheme alignment, primary word stress, and filled pauses. This format's intended use is with the Praat software (https://www.fon.hum.uva.nl/praat/) for research and applications in phonetics and other speech-focused disciplines.

For a detailed dataset schema description and examples, please see our dedicated website: https://clarinsi.github.io/parlaspeech/.

Identifier
PID	http://hdl.handle.net/11356/1833
Related Identifier	https://doi.org/10.1007/978-3-031-77961-9_10
Related Identifier	https://clarinsi.github.io/parlaspeech/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1833

Provenance
Creator	Ljubešić, Nikola; Rupnik, Peter; Porupski, Ivan; Kuzman Pungeršek, Taja; Koržinek, Danijel; Kopp, Matyáš
Publisher	Jožef Stefan Institute
Publication Year	2025
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Croatian; Serbian; Polish; Czech
Resource Type	corpus
Format	text/plain; charset=utf-8; application/gzip; application/octet-stream; downloadable_files_count: 10
Discipline	Linguistics