Parliamentary spoken corpus of Polish ParlaSpeech-PL 1.0

PID

The ParlaSpeech-PL dataset is built from the transcripts of parliamentary proceedings available in the Polish part of the ParlaMint corpus, and the parliamentary recordings available from the Polish Parliament's YouTube channel. The corpus consists of audio segments that correspond to specific sentences in the transcripts. The transcript contains word-level alignments to the recordings, allowing for simple further segmentation of long sentences into shorter segments for ASR and other memory-sensitive applications. Each segment has a reference to the ParlaMint 4.0 corpus (http://hdl.handle.net/11356/1859) via utterance IDs and character offsets. All the speaker information from the ParlaMint corpus is available via the "speaker_info" key.

Identifier
PID http://hdl.handle.net/11356/1686
Related Identifier https://aclanthology.org/2022.parlaclarin-1.16
Related Identifier http://hdl.handle.net/11356/1679
Related Identifier https://www.clarin.eu/parlamint
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1686
Provenance
Creator Koržinek, Danijel; Ljubešić, Nikola
Publisher Jožef Stefan Institute
Publication Year 2024
Rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Polish
Resource Type corpus
Format text/plain; charset=utf-8; application/gzip; application/octet-stream; text/plain; downloadable_files_count: 4
Discipline Linguistics