Prague Czech-English Dependency Treebank 2.0

Dataset

PID

Texts

The Prague Czech-English Dependency Treebank 2.0 (PCEDT 2.0) is a major update of the Prague Czech-English Dependency Treebank 1.0 (LDC2004T25). It is a manually parsed Czech-English parallel corpus sized over 1.2 million running words in almost 50,000 sentences for each part.

Data

The English part contains the entire Penn Treebank - Wall Street Journal Section (LDC99T42). The Czech part consists of Czech translations of all of the Penn Treebank-WSJ texts. The corpus is 1:1 sentence-aligned. An additional automatic alignment on the node level (different for each annotation layer) is part of this release, too. The original Penn Treebank-like file structure (25 sections, each containing up to one hundred files) has been preserved. Only those PTB documents which have both POS and structural annotation (total of 2312 documents) have been translated to Czech and made part of this release.

Each language part is enhanced with a comprehensive manual linguistic annotation in the PDT 2.0 style (LDC2006T01, Prague Dependency Treebank 2.0). The main features of this annotation style are: dependency structure of the content words and coordinating and similar structures (function words are attached as their attribute values) semantic labeling of content words and types of coordinating structures argument structure, including an argument structure ("valency") lexicon for both languages ellipsis and anaphora resolution. This annotation style is called tectogrammatical annotation and it constitutes the tectogrammatical layer in the corpus. For more details see below and documentation.

Annotation of the Czech part

Sentences of the Czech translation were automatically morphologically annotated and parsed into surface-syntax dependency trees in the PDT 2.0 annotation style. This annotation style is sometimes called analytical annotation; it constitutes the analytical layer of the corpus. The manual tectogrammatical (deep-syntax) annotation was built as a separate layer above the automatic analytical (surface-syntax) parse. A sample of 2,000 sentences was manually annotated on the analytical layer.

Annotation of the English part

The resulting manual tectogrammatical annotation was built above an automatic transformation of the original phrase-structure annotation of the Penn Treebank into surface dependency (analytical) representations, using the following additional linguistic information from other sources: PropBank (LDC2004T14) VerbNet NomBank (LDC2008T23) flat noun phrase structures (by courtesy of D. Vadas and J.R. Curran) For each sentence, the original Penn Treebank phrase structure trees are preserved in this corpus together with their links to the analytical and tectogrammatical annotation.

Identifier
PID	http://hdl.handle.net/11858/00-097C-0000-0015-8DAF-4
Related Identifier	http://hdl.handle.net/11234/1-1664
Related Identifier	http://ufal.mff.cuni.cz/pcedt2.0
Metadata Access	http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:11858/00-097C-0000-0015-8DAF-4

Provenance
Creator	Hajič, Jan; Hajičová, Eva; Panevová, Jarmila; Sgall, Petr; Cinková, Silvie; Fučíková, Eva; Mikulová, Marie; Pajas, Petr; Popelka, Jan; Semecký, Jiří; Šindlerová, Jana; Štěpánek, Jan; Toman, Josef; Urešová, Zdeňka; Žabokrtský, Zdeněk
Publisher	Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication Year	2012
Funding Reference	info:eu-repo/grantAgreement/EC/FP7/231720; info:eu-repo/grantAgreement/EC/FP7/247762
Rights	CC-BY-NC-SA + LDC99T42; https://lindat.mff.cuni.cz/repository/xmlui/page/license-pcedt2; RES
OpenAccess	true
Contact	lindat-help(at)ufal.mff.cuni.cz

Representation
Language	Czech; English
Resource Type	corpus
Format	application/zip; text/plain; charset=utf-8; downloadable_files_count: 3
Discipline	Linguistics