Dataset for primary stress identification in Croatian and related languages and dialects

Dataset

PID

The dataset contains recordings and offset annotations of a sample of the Croaitan parliamentary recordings from the corpus ParlaSpeech-HR. It contains training and testing data for primary stress identification from the speech signal on the level of a single word. Additional test datasets are available in three languages / dialects: Slovenian, Chakavian dialect of Croatian, and Serbian.

The data is split in four sections based on their provenance: ParlaStress-HR.jsonl - Croatian train and test datasets, sampled from the ParlaSpeech-HR 2.0 (http://hdl.handle.net/11356/1914) ParlaStress-SR.jsonl - Serbian test dataset, sampled from the ParlaSpeech-RS (http://hdl.handle.net/11356/1834) MićiPrinc-CKM.jsonl - Chakavian test dataset, sampled from the Mići Princ dataset (http://hdl.handle.net/11356/1765) Artur-SL.jsonl - Slovenian test dataset, sampled from the Artur dataset (http://hdl.handle.net/11356/1776)

All JSONL files have the following attributes: * id: string * audio_wav: string, path to the audio file * audio_start, audio_end: float, seconds of the start and end times in the original audio file, useful for calculating sample duration, as well as reference to original audio * multisyllabic_words: a list of dictionaries, each entry corresponding to one multisyllabic word with stress information, with keys: word: string, word in question time_s: float, start of word in seconds from the start of the recording, time_e: float, end of word in seconds from the start of the recording, syllable_count: int, number of syllables in the word, stress: a list with a single dictionary (for consistency with unstressed) describing the stressed vowel with keys: vowel: string, character of the word that is stressed time_s: float, vowel start in seconds from the start of the word time_e: float, vowel end in seconds from the start of the word char_idx: int, index of stressed character in the word unstress: same as stress, but for unstressed vowels * graphalign_intervals: a list of dictionaries describing time alignment of individual graphemes / phonemes, with keys: label: string, character that is being aligned time_s: float, character start in seconds from the start of the word time_e: float, character end in seconds from the start of the word

In addition, ParlaStress-HR.jsonl also has the attribute "split_speaker" that assigns individual instances into "train" or "test" splits. These splits ensure that different speakers are found in the training and the testing section.

Identifier
PID	http://hdl.handle.net/11356/2038
Related Identifier	https://doi.org/10.48550/arXiv.2505.24571
Related Identifier	https://clarinsi.github.io/parlaspeech/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/2038

Provenance
Creator	Ljubešić, Nikola; Rupnik, Peter; Porupski, Ivan; Robida, Nejc; Potočnjak, Mirna
Publisher	Jožef Stefan Institute
Publication Year	2025
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Croatian; Slovenian; Slovene; Serbian
Resource Type	corpus
Format	application/zip; text/plain; text/plain; charset=utf-8; downloadable_files_count: 3
Discipline	Linguistics