SYN v9: large corpus of written Czech

Dataset

PID

Corpus of contemporary written (printed) Czech sized 4.7 GW (i.e. 5.7 billion tokens). It covers mostly the 1990-2019 period and features rich metadata including detailed bibliographical information, text-type classification etc. SYN v9 contains a wide variety of text types (fiction, non-fiction, newspapers), but the newspapers prevail noticeably. The corpus is lemmatized and morphologically tagged by the new CNC tagset first utilized for the annotation of the SYN2020 corpus.

SYN v9 is provided in a CoNLL-U-like vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via the KonText query interface to the registered users of CNC at http://www.korpus.cz with one important exception: the corpus is shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) with ordering randomized within the given document.

Identifier
PID	http://hdl.handle.net/11234/1-4635
Related Identifier	http://hdl.handle.net/11234/1-1846
Related Identifier	https://wiki.korpus.cz/doku.php/en:cnk:syn:verze9
Metadata Access	http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:11234/1-4635

Provenance
Creator	Křen, Michal; Cvrček, Václav; Henyš, Jan; Hnátková, Milena; Jelínek, Tomáš; Kocek, Jan; Kováříková, Dominika; Křivan, Jan; Milička, Jiří; Petkevič, Vladimír; Procházka, Pavel; Skoumalová, Hana; Šindlerová, Jana; Škrabal, Michal
Publisher	Charles University, Faculty of Arts, Institute of the Czech National Corpus
Publication Year	2021
Rights	Czech National Corpus (Shuffled Corpus Data); https://lindat.mff.cuni.cz/repository/static/license-cnc.html; ACA
OpenAccess	true
Contact	lindat-help(at)ufal.mff.cuni.cz

Representation
Language	Czech
Resource Type	corpus
Format	application/x-xz; application/octet-stream; downloadable_files_count: 1
Discipline	Linguistics