Spoken corpus Gos 2.1 (transcriptions)

Dataset

PID

The spoken corpus Gos 2.1 is the reference speech corpus of the Slovenian language. This second edition contains about 300 hours of speech, or 2.4 million words, 127 thousand utterances and 1,500 texts, with added word-level temporal information, where available.

Gos2.1 is composed from three different sources: (1) Spoken corpus Gos 1.1 (http://hdl.handle.net/11356/1438), 112 hours, 1 million words (2) Spoken corpus Gos VideoLectures 4.2 (http://hdl.handle.net/11356/1444), 22 hours, 179,000 words (3) A selection from the ASR database ARTUR 1.0 (http://hdl.handle.net/11356/1772), 185 hours, 1.2 mllion words, including: (3a) Artur-J-Splosni, 62 hours, 422,000 words: transcriptions of media recordings, online recordings of conferences, workshops, education videos, etc. (3b) Artur-N-Prosti, 61 hours, 324,000 words: transcriptions of monologues and dialogues between two persons, recorded for the purposes of the Artur database. Speakers were asked to freely conversate or freely explain on casual topics. (3c) Artur-P-SejeDZ, 62 hours, 450,000 words: a selection of transcriptions of speech from the Slovene National Assembly. The maximum length of single speaker speech is 4,000 words.

All transcriptions are manual and made in two modes: - pronunciation-based or citation-phonemic transcriptions (containing the output phoneme string derived from the orthographic form by letter-to-sound rules) - standardised or expanded orthographic transcriptions (the standard Slovene spelling is used to indicate the spoken words, but there are additional rules and word-lists for non-standard lexis).

Part-of-speech tagging with MULTEXT-East morphosyntactic descriptions and lemmatisation was performed automatically with CLASSLA (https://github.com/clarinsi/classla).

In comparison with the preceding version, Gos 2.0, this version implements the following changes: (1) Unification of casing and normalization decisions between Gos1.0, GosVL, and Artur subcorpora; (2) Re-introduction of punctuation into the Artur subcorpus; (3) Addition of word-level temporal information in form of a separate timeline element (); (4) Re-encoding of the Gos and Gos VideoLectures subcorpora so that the encoding is uniform across all subcorpora.

The corpus is distributed in the source TRS (Transcriber) format, in the canonical TEI encoding (both XML), as well as TEI-derived plain text (TXT) and vertical (vert) format.

Identifier
PID	http://hdl.handle.net/11356/1863
Related Identifier	http://hdl.handle.net/11356/1771
Related Identifier	https://viri.cjvt.si/gos/System/About
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1863

Provenance
Creator	Verdonik, Darinka; Zwitter Vitez, Ana; Zemljarič Miklavčič, Jana; Krek, Simon; Stabej, Marko; Erjavec, Tomaž; Potočnik, Tomaž; Sepesy Maučec, Mirjam; Majhenič, Simona; Žgank, Andrej; Bizjak, Andreja; Gril, Lucija; Dobrišek, Simon; Križaj, Janez; Bajec, Marko; Lebar Bajec, Iztok; Jelovšek, Tjaša; Trojar, Mitja; Bernjak, Mitja; Dretnik, Naum; Strle, Gregor; Dobrovoljc, Kaja; Ljubešić, Nikola; Rupnik, Peter
Publisher	Centre for Language Resources and Technologies, University of Ljubljana; Faculty of Electrical Engineering and Computer Science, University of Maribor; Faculty of Electrical Engineering, University of Ljubljana; Faculty of Computer and Information Science, University of Ljubljana; ZRC SAZU; Jožef Stefan Institute
Publication Year	2023
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); PUB; https://creativecommons.org/licenses/by-sa/4.0/
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	corpus
Format	text/plain; charset=utf-8; application/zip; downloadable_files_count: 4
Discipline	Linguistics