Spoken Torlak dialect corpus 1.0 (transcription)

Dataset

PID

Torlak corpus represents a spoken variety of the endangered Torlak dialect from the Timok area in Southeast Serbia. It comprises transcripts of interviews with the local population, collected in the field between 2015 and 2017. Semi-structured interviews were conducted eliciting spontaneous speech in the form of long narratives about traditional culture and history. The corpus is made up of semi-orthographic transcripts of 86.5 hours of recordings from locations evenly distributed across the Timok area of the Torlak dialect zone. The dialect is presently under the influence of a more prestigious Standard Serbian variety and expresses a great deal of variation in the use of non-standard features. The corpus contains samples of the typical representatives of the dialect with little influence of the standard, as well as a smaller portion of speakers who use both dialect and standard features. The corpus contains 489,021 tokens with accentuation, morphosyntacitc tags and lemmatisation. Accentuation was done manually by trained transcribers. Morphosyntactic annotation and lemmatisation (available in the TEI and vertical formats of the corpus) were done automatically, with minor manual corrections. The morphosyntactic tags follow the MULTEXT-East specificatins for Torlak, cf. https://github.com/clarinsi/mte-msd.

Identifier
PID	http://hdl.handle.net/11356/1281
Related Identifier	https://doi.org/10.1007/s10579-020-09522-4
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1281

Provenance
Creator	Vuković, Teodora
Publisher	Slavisches Seminar, University of Zurich
Publication Year	2020
Funding Reference	info:eu-repo/grantAgreement/EC/ERA.Net RUS Plus/IZRPZ0_177557
Rights	Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0); PUB; https://creativecommons.org/licenses/by-nc/4.0/
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Serbian
Resource Type	corpus
Format	application/zip; text/plain; charset=utf-8; downloadable_files_count: 4
Discipline	Linguistics