Training corpus of spoken Slovenian ROG 1.1

PID

Training corpus of spoken Slovenian ROG 1.1 is an improved version of the ROG 1.0 corpus (http://hdl.handle.net/11356/1992). The main differences between the original and the current version are: - Manually corrected Prosodic Unit annotations in ROG-Art - Release of ROG-Art in ISO TEI format - Omission of TextGrid files

The current version preserves the extent of the data and its composition:

  1. ROG-SST, which includes selected Gos 2.1 (http://hdl.handle.net/11356/1863) transcriptions with:
  2. manually assigned lemmas and morphosyntactic tags according to the MULTEXT-East annotation scheme (https://nl.ijs.si/ME/V6/msd/html/msd-sl.html),
  3. manual annotations according to the Universal Dependencies annotation scheme (i.e. part-of-speech categories, morphological features and syntactic dependencies)

In total, ROG-SST spans 76341 words and 6108 sentences. ROG-SST is distributed as CONLL-U format (2014-2024) (.conllu files). Project website: https://spot.ff.uni-lj.si/en/.

  1. ROG-Art, which includes:
  2. all the annotation layers from the ROG-SST
  3. prosodic units annotations
  4. disfluencies annotation
  5. dialogue acts annotation

ROG-Art is distributed as: - EXMARaLDA format (.EXB files) for viewing with Partitur Editor (https://www.exmaralda.org/) - .EXS files and Rog-Art.coma file for searching through the annotated corpus in the EXMARaLDA EXAKT concordancer (https://www.exmaralda.org/) - .TRS files for viewing the transcriptions without annotations with Transcriber (https://trans.sourceforge.net/en/presentation.php) - ISO TEI files for cross-platform compatibility.

ROG-Art consists of 39001 words in 1969 sentences. WAV files are only available for the ROG-Art part. They must be copied to the WAV folder of the ROG-Art folder structure to enable automatic opening of WAV files in EXMARaLDA or Transcriber tools. WAV recording are single channel, sampled with 44100 Hz, with 16 bit precision.

Identifier
PID http://hdl.handle.net/11356/2062
Related Identifier https://doi.org/10.5281/zenodo.13936426
Related Identifier http://hdl.handle.net/11356/1992
Related Identifier https://mezzanine.um.si/en/mezzanine-english/
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/2062
Provenance
Creator Verdonik, Darinka; Dobrovoljc, Kaja; Rupnik, Peter; Ljubešić, Nikola; Majhenič, Simona; Čibej, Jaka; Schmidt, Thomas; Vidinić, Jasna
Publisher Faculty of Electrical Engineering and Computer Science, University of Maribor; Jožef Stefan Institute; Faculty of Arts, University of Ljubljana
Publication Year 2026
Rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene
Resource Type corpus
Format application/zip; text/plain; charset=utf-8; downloadable_files_count: 2
Discipline Linguistics