The spoken corpus Gos 2.1 is the reference speech corpus of the Slovenian language. This second edition contains about 300 hours of speech, or 2.4 million words, 127 thousand utterances and 1,500 texts, with added word-level temporal information, where available.
Gos2.1 is composed from three different sources:
(1) Spoken corpus Gos 1.1 (http://hdl.handle.net/11356/1438), 112 hours, 1 million words
(2) Spoken corpus Gos VideoLectures 4.2 (http://hdl.handle.net/11356/1444), 22 hours, 179,000 words
(3) A selection from the ASR database ARTUR 1.0 (http://hdl.handle.net/11356/1772), 185 hours, 1.2 mllion words, including:
(3a) Artur-J-Splosni, 62 hours, 422,000 words: transcriptions of media recordings, online recordings of conferences, workshops, education videos, etc.
(3b) Artur-N-Prosti, 61 hours, 324,000 words: transcriptions of monologues and dialogues between two persons, recorded for the purposes of the Artur database. Speakers were asked to freely conversate or freely explain on casual topics.
(3c) Artur-P-SejeDZ, 62 hours, 450,000 words: a selection of transcriptions of speech from the Slovene National Assembly. The maximum length of single speaker speech is 4,000 words.
All transcriptions are manual and made in two modes:
- pronunciation-based or citation-phonemic transcriptions (containing the output phoneme string derived from the orthographic form by letter-to-sound rules)
- standardised or expanded orthographic transcriptions (the standard Slovene spelling is used to indicate the spoken words, but there are additional rules and word-lists for non-standard lexis).
Part-of-speech tagging with MULTEXT-East morphosyntactic descriptions and lemmatisation was performed automatically with CLASSLA (https://github.com/clarinsi/classla).
In comparison with the preceding version, Gos 2.0, this version implements the following changes:
(1) Unification of casing and normalization decisions between Gos1.0, GosVL, and Artur subcorpora;
(2) Re-introduction of punctuation into the Artur subcorpus;
(3) Addition of word-level temporal information in form of a separate timeline element ();
(4) Re-encoding of the Gos and Gos VideoLectures subcorpora so that the encoding is uniform across all subcorpora.
The corpus is distributed in the source TRS (Transcriber) format, in the canonical TEI encoding (both XML), as well as TEI-derived plain text (TXT) and vertical (vert) format.