Parliamentary spoken corpus of Polish ParlaSpeech-PL 1.0

Dataset

PID

The ParlaSpeech-PL dataset is built from the transcripts of parliamentary proceedings available in the Polish part of the ParlaMint corpus, and the parliamentary recordings available from the Polish Parliament's YouTube channel. The corpus consists of audio segments that correspond to specific sentences in the transcripts. The transcript contains word-level alignments to the recordings, allowing for simple further segmentation of long sentences into shorter segments for ASR and other memory-sensitive applications. Each segment has a reference to the ParlaMint 4.0 corpus (http://hdl.handle.net/11356/1859) via utterance IDs and character offsets. All the speaker information from the ParlaMint corpus is available via the "speaker_info" key.

Identifier
PID	http://hdl.handle.net/11356/1686
Related Identifier	https://aclanthology.org/2022.parlaclarin-1.16
Related Identifier	http://hdl.handle.net/11356/1679
Related Identifier	https://www.clarin.eu/parlamint
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1686

Provenance
Creator	Koržinek, Danijel; Ljubešić, Nikola
Publisher	Jožef Stefan Institute
Publication Year	2024
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Polish
Resource Type	corpus
Format	text/plain; charset=utf-8; application/gzip; application/octet-stream; text/plain; downloadable_files_count: 4
Discipline	Linguistics