Training corpus of spoken Slovenian ROG 1.0 is the main resource for Slovenian language to train and evaluate technologies aimed at processing speech or speech transcripts, such as part-of-speech taggers, parsers, prosodic unit segmenters, disfluency identifiers, dialogue act classifiers etc. It is also suitable for performing speech-related research. It consists of two parts:
- ROG-SST, which includes selected Gos 2.1 (http://hdl.handle.net/11356/1863) transcriptions with:
- manually assigned lemmas and morphosyntactic tags according to the MULTEXT-East annotation scheme (https://nl.ijs.si/ME/V6/msd/html/msd-sl.html),
- manual annotations according to the Universal Dependencies annotation scheme (i.e. part-of-speech categories, morphological features and syntactic dependencies)
In total, ROG-SST spans 76341 words and 6108 sentences. ROG-SST is distributed as CONLL-U format (2014-2024) (.conllu files). Project website: https://spot.ff.uni-lj.si/en/.
- ROG-Art, which includes:
- all the annotation layers from the ROG-SST
- prosodic units annotations
- disfluencies annotation
- dialogue acts annotation
ROG-Art is distributed as:
- EXMARaLDA format (.EXB files) for viewing with Partitur Editor (https://www.exmaralda.org/)
- .EXS files and Rog-Art.coma file for searching through the annotated corpus in the EXMARaLDA EXAKT concordancer (https://www.exmaralda.org/)
- .TRS files for viewing the transcriptions without annotations with Transcriber (https://trans.sourceforge.net/en/presentation.php)
- .TextGrid files with additional prosodic annotations for viewing with Praat (TeG folder, www.praat.org)
ROG-Art consists of 39001 words in 1969 sentences. WAV files are only available for the ROG-Art part. They must be copied to the WAV folder of the ROG-Art folder structure to enable automatic opening of WAV files in EXMARaLDA or Transcriber tools. WAV recording are single channel, sampled with 44100 Hz, with 16 bit precision.