The ParlaSpeech corpora are built from the transcripts of parliamentary proceedings of Croatian, Serbian, Polish, and Czech parliaments available in the ParlaMint 4.0 corpus (http://hdl.handle.net/11356/1859), and the parliamentary recordings available from the parliaments' YouTube channels. An instance is a transcript sentence with the corresponding metadata and the aligned audio.
This version of the ParlaSpeech corpora does not release the audio files as it covers the same data as the preceding versions, i.e. version 2.0 for HR (http://hdl.handle.net/11356/1914) and version 1.0 for RS (http://hdl.handle.net/11356/1834), PL (http://hdl.handle.net/11356/1686), and CZ (http://hdl.handle.net/11356/1785). This version's main extension are five enrichment layers:
* ParlaSpeech-Pause: automatic annotations of filled pauses ("eerm")
* ParlaSpeech-Align: precise word- and grapheme-level alignment (HR, RS only)
* ParlaSpeech-Stress: Labelled primary stress in multisyllabic words (HR, RS only)
* ParlaSpeech-Ling: Universal Dependencies (UD) formatted linguistic annotations (lemma, part-of-speech, syntax, etc.)
* ParlaSpeech-Senti: sentiment estimation based on the transcript
Data size per parliament is the following:
* Croatia (HR): 923k sentences, 3k hours, 324k filled pauses, 11M word stresses
* Serbia (RS): 291k sentences, 900 hours, 74k filled pauses, 2M word stresses
* Czechia (CZ): 718k sentences, 1.2k hours, 200k filled pauses, no word stresses
* Poland (PL): 535k sentences, 1k hours, 200k filled pauses, no word stresses
The data are available in the following formats:
* JSONL: master format, containing all the data. Distributed as a newline delimited JSON, where each line is a valid JSON serialization. Mostly intended for computerized processing.
* VERT: vertical format intended for concordancers with text, links to audio, linguistic annotations, sentiment, filled pauses, and primary word stress (where available).
* TextGrid (HR and RS only): word- and grapheme alignment, primary word stress, and filled pauses. This format's intended use is with the Praat software (https://www.fon.hum.uva.nl/praat/) for research and applications in phonetics and other speech-focused disciplines.
For a detailed dataset schema description and examples, please see our dedicated website: https://clarinsi.github.io/parlaspeech/.