The dataset contains recordings and offset annotations of a sample of the Croaitan parliamentary recordings from the corpus ParlaSpeech-HR. It contains training and testing data for primary stress identification from the speech signal on the level of a single word. Additional test datasets are available in three languages / dialects: Slovenian, Chakavian dialect of Croatian, and Serbian.
The data is split in four sections based on their provenance:
ParlaStress-HR.jsonl - Croatian train and test datasets, sampled from the ParlaSpeech-HR 2.0 (http://hdl.handle.net/11356/1914)
ParlaStress-SR.jsonl - Serbian test dataset, sampled from the ParlaSpeech-RS (http://hdl.handle.net/11356/1834)
MićiPrinc-CKM.jsonl - Chakavian test dataset, sampled from the Mići Princ dataset (http://hdl.handle.net/11356/1765)
Artur-SL.jsonl - Slovenian test dataset, sampled from the Artur dataset (http://hdl.handle.net/11356/1776)
All JSONL files have the following attributes:
* id: string
* audio_wav: string, path to the audio file
* audio_start, audio_end: float, seconds of the start and end times in the original audio file, useful for calculating sample duration, as well as reference to original audio
* multisyllabic_words: a list of dictionaries, each entry corresponding to one multisyllabic word with stress information, with keys:
word: string, word in question
time_s: float, start of word in seconds from the start of the recording,
time_e: float, end of word in seconds from the start of the recording,
syllable_count: int, number of syllables in the word,
stress: a list with a single dictionary (for consistency with unstressed) describing the stressed vowel with keys:
vowel: string, character of the word that is stressed
time_s: float, vowel start in seconds from the start of the word
time_e: float, vowel end in seconds from the start of the word
char_idx: int, index of stressed character in the word
unstress: same as stress, but for unstressed vowels
* graphalign_intervals: a list of dictionaries describing time alignment of individual graphemes / phonemes, with keys:
label: string, character that is being aligned
time_s: float, character start in seconds from the start of the word
time_e: float, character end in seconds from the start of the word
In addition, ParlaStress-HR.jsonl also has the attribute "split_speaker" that assigns individual instances into "train" or "test" splits. These splits ensure that different speakers are found in the training and the testing section.