Spoken corpus Gos VideoLectures 4.1 (transcription)

PID

Gos VideoLectures is an add-on to the Gos reference corpus of spoken Slovene (http://hdl.handle.net/11356/1040), and covers public academic speech. It can be used for training continuous speech recognition for Slovene language, for phonetic research or any other research of Slovene academic speech. The corpus contains a selection of public lectures available through the web portal Videolectures.net provided by the Jožef Stefan Institute, and covers 55 lectures and 22 hours of speech. This resource contains only annotated transcriptions of the corpus, while the audio recordings are available at http://hdl.handle.net/11356/1222.

The transcriptions for Gos VideoLectures were done manually and carefully checked. The main guidelines for transcription were those of the Gos corpus (http://www.korpus-gos.net/Support/About). The transcription tool Transcriber 1.5.1 (http://trans.sourceforge.net/en/presentation.php) was used for making transcriptions. It can be also used for reading or exporting transcriptions (.trs files) to different formats.

The transcriptions comprise the TRS files with tabular metadata, their conversion to TEI and to vertical file format (as used e.g. by Sketch Engine). Each recording has two TRS files, one with pronunciation-based and the other with the standardised/normalised transcription. The TRS zip also contains files with automatically produced word and phone-level alignment with the speech signal, as well as the annotation guidelines (in Slovenian).

The TEI and vertical encodings join the two transcriptions at the token level, with the normalised words also linguistically annotated. The annotiations comprise the word lemma, the MULTEXT-East MSDs and the Universal dependencies morphological features.

As opposed to version 4.0, this version uses the CLASSLA tool (https://github.com/clarinsi/classla) for linguistic annotation and changes the TEI encoding of the normalised words.

Identifier
PID http://hdl.handle.net/11356/1439
Related Identifier http://hdl.handle.net/11356/1223
Related Identifier http://hdl.handle.net/11356/1444
Related Identifier https://www.clarin.si/info/services/projects/#Gos_Videolectures_II
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1439
Provenance
Creator Verdonik, Darinka; Potočnik, Tomaž; Sepesy Maučec, Mirjam; Erjavec, Tomaž; Majhenič, Simona; Žgank, Andrej
Publisher Faculty of Electrical Engineering and Computer Science, University of Maribor
Publication Year 2021
Rights Creative Commons - Attribution 4.0 International (CC BY 4.0); https://creativecommons.org/licenses/by/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 3
Discipline Linguistics