Training corpus of spoken Slovenian ROG 1.0

PID

Training corpus of spoken Slovenian ROG 1.0 is the main resource for Slovenian language to train and evaluate technologies aimed at processing speech or speech transcripts, such as part-of-speech taggers, parsers, prosodic unit segmenters, disfluency identifiers, dialogue act classifiers etc. It is also suitable for performing speech-related research. It consists of two parts:

  1. ROG-SST, which includes selected Gos 2.1 (http://hdl.handle.net/11356/1863) transcriptions with:
  2. manually assigned lemmas and morphosyntactic tags according to the MULTEXT-East annotation scheme (https://nl.ijs.si/ME/V6/msd/html/msd-sl.html),
  3. manual annotations according to the Universal Dependencies annotation scheme (i.e. part-of-speech categories, morphological features and syntactic dependencies)

In total, ROG-SST spans 76341 words and 6108 sentences. ROG-SST is distributed as CONLL-U format (2014-2024) (.conllu files). Project website: https://spot.ff.uni-lj.si/en/.

  1. ROG-Art, which includes:
  2. all the annotation layers from the ROG-SST
  3. prosodic units annotations
  4. disfluencies annotation
  5. dialogue acts annotation

ROG-Art is distributed as: - EXMARaLDA format (.EXB files) for viewing with Partitur Editor (https://www.exmaralda.org/) - .EXS files and Rog-Art.coma file for searching through the annotated corpus in the EXMARaLDA EXAKT concordancer (https://www.exmaralda.org/) - .TRS files for viewing the transcriptions without annotations with Transcriber (https://trans.sourceforge.net/en/presentation.php) - .TextGrid files with additional prosodic annotations for viewing with Praat (TeG folder, www.praat.org)

ROG-Art consists of 39001 words in 1969 sentences. WAV files are only available for the ROG-Art part. They must be copied to the WAV folder of the ROG-Art folder structure to enable automatic opening of WAV files in EXMARaLDA or Transcriber tools. WAV recording are single channel, sampled with 44100 Hz, with 16 bit precision.

Identifier
PID http://hdl.handle.net/11356/1992
Related Identifier https://doi.org/10.5281/zenodo.13936426
Related Identifier https://mezzanine.um.si/en/mezzanine-english/
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1992
Provenance
Creator Verdonik, Darinka; Dobrovoljc, Kaja; Rupnik, Peter; Ljubešić, Nikola; Majhenič, Simona; Čibej, Jaka; Schmidt, Thomas
Publisher Faculty of Electrical Engineering and Computer Science, University of Maribor; Jožef Stefan Institute; Faculty of Arts, University of Ljubljana
Publication Year 2024
Rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 2
Discipline Linguistics