Spoken Torlak dialect corpus 1.0 (transcription)


Torlak corpus represents a spoken variety of the endangered Torlak dialect from the Timok area in Southeast Serbia. It comprises transcripts of interviews with the local population, collected in the field between 2015 and 2017. Semi-structured interviews were conducted eliciting spontaneous speech in the form of long narratives about traditional culture and history. The corpus is made up of semi-orthographic transcripts of 86.5 hours of recordings from locations evenly distributed across the Timok area of the Torlak dialect zone. The dialect is presently under the influence of a more prestigious Standard Serbian variety and expresses a great deal of variation in the use of non-standard features. The corpus contains samples of the typical representatives of the dialect with little influence of the standard, as well as a smaller portion of speakers who use both dialect and standard features. The corpus contains 489,021 tokens with accentuation, morphosyntacitc tags and lemmatisation. Accentuation was done manually by trained transcribers. Morphosyntactic annotation and lemmatisation (available in the TEI and vertical formats of the corpus) were done automatically, with minor manual corrections. The morphosyntactic tags follow the MULTEXT-East specificatins for Torlak, cf. https://github.com/clarinsi/mte-msd.

PID http://hdl.handle.net/11356/1281
Related Identifier https://doi.org/10.1007/s10579-020-09522-4
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1281
Creator Vuković, Teodora
Publisher Slavisches Seminar, University of Zurich
Publication Year 2020
Funding Reference info:eu-repo/grantAgreement/EC/ERA.Net RUS Plus/IZRPZ0_177557
Rights Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0); PUB; https://creativecommons.org/licenses/by-nc/4.0/
OpenAccess true
Contact info(at)clarin.si
Language Serbian
Resource Type corpus
Format application/zip; text/plain; charset=utf-8; downloadable_files_count: 4
Discipline Linguistics