ORAL2008: Balanced corpus of informal spoken Czech

Dataset

PID

Balanced corpus of informal spoken Czech sized 1 MW. It contains transcriptions of 297 recordings made in 2002–2007 in the whole of Bohemia. All the recordings were made in informal situations to ensure prototypically spontaneous spoken language. This means private environment, physical presence of speakers who know each other, unscripted speech and topic not given in advance. The total number of speakers is 995, the corpus is balanced in their main sociolinguistic categories (gender, age group, education, region of childhood residence).

The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus exactly correspond to the corpus available via query interface to registered users of the CNC.

Identifier
PID	http://hdl.handle.net/11858/00-097C-0000-0023-119D-A
Related Identifier	https://wiki.korpus.cz/doku.php/cnk:oral2008
Metadata Access	http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-119D-A

Provenance
Creator	Waclawičová, Martina; Kopřivová, Marie; Křen, Michal; Válková, Lucie
Publisher	Faculty of Arts, Institute of the Czech National Corpus, Charles University in Prague
Publication Year	2008
Rights	Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0); http://creativecommons.org/licenses/by-nc-sa/3.0/; PUB
OpenAccess	true
Contact	lindat-help(at)ufal.mff.cuni.cz

Representation
Language	Czech
Resource Type	corpus
Format	application/x-gzip; application/octet-stream; downloadable_files_count: 1
Discipline	Linguistics