Spoken corpora of parliamentary debates ParlaSpeech 3.0

PID

The ParlaSpeech corpora are built from the transcripts of parliamentary proceedings of Croatian, Serbian, Polish, and Czech parliaments available in the ParlaMint 4.0 corpus (http://hdl.handle.net/11356/1859), and the parliamentary recordings available from the parliaments' YouTube channels. An instance is a transcript sentence with the corresponding metadata and the aligned audio.

This version of the ParlaSpeech corpora does not release the audio files as it covers the same data as the preceding versions, i.e. version 2.0 for HR (http://hdl.handle.net/11356/1914) and version 1.0 for RS (http://hdl.handle.net/11356/1834), PL (http://hdl.handle.net/11356/1686), and CZ (http://hdl.handle.net/11356/1785). This version's main extension are five enrichment layers:

* ParlaSpeech-Pause: automatic annotations of filled pauses ("eerm")
* ParlaSpeech-Align: precise word- and grapheme-level alignment (HR, RS only)
* ParlaSpeech-Stress: Labelled primary stress in multisyllabic words (HR, RS only)
* ParlaSpeech-Ling: Universal Dependencies (UD) formatted linguistic annotations (lemma, part-of-speech, syntax, etc.)
* ParlaSpeech-Senti: sentiment estimation based on the transcript

Data size per parliament is the following:

* Croatia (HR): 923k sentences, 3k hours, 324k filled pauses, 11M word stresses
* Serbia (RS): 291k sentences, 900 hours, 74k filled pauses, 2M word stresses
* Czechia (CZ): 718k sentences, 1.2k hours, 200k filled pauses, no word stresses
* Poland (PL): 535k sentences, 1k hours, 200k filled pauses, no word stresses

The data are available in the following formats:

* JSONL: master format, containing all the data. Distributed as a newline delimited JSON, where each line is a valid JSON serialization. Mostly intended for computerized processing.
* VERT: vertical format intended for concordancers with text, links to audio, linguistic annotations, sentiment, filled pauses, and primary word stress (where available).
* TextGrid (HR and RS only): word- and grapheme alignment, primary word stress, and filled pauses. This format's intended use is with the Praat software (https://www.fon.hum.uva.nl/praat/) for research and applications in phonetics and other speech-focused disciplines.

For a detailed dataset schema description and examples, please see our dedicated website: https://clarinsi.github.io/parlaspeech/.

Identifier
PID http://hdl.handle.net/11356/1833
Related Identifier https://doi.org/10.1007/978-3-031-77961-9_10
Related Identifier https://clarinsi.github.io/parlaspeech/
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1833
Provenance
Creator Ljubešić, Nikola; Rupnik, Peter; Porupski, Ivan; Kuzman Pungeršek, Taja; Koržinek, Danijel; Kopp, Matyáš
Publisher Jožef Stefan Institute
Publication Year 2025
Rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Croatian; Serbian; Polish; Czech
Resource Type corpus
Format text/plain; charset=utf-8; application/gzip; application/octet-stream; downloadable_files_count: 10
Discipline Linguistics