Coreference in Universal Dependencies 1.4 (CorefUD 1.4)

Dataset

PID

CorefUD is a collection of previously existing coreference-annotated datasets that have been converted to a unified annotation scheme. In its current version (1.4), CorefUD comprises 33 datasets covering 19 languages. The datasets are enriched with automatically assigned morphological and syntactic annotations, fully compliant with the standards of the Universal Dependencies project, in cases where manual morphosyntactic annotation is not available or cannot be reliably converted. The data are stored in the CoNLL-U format, with coreference- and bridging-specific information encoded as attribute–value pairs in the MISC column. The collection is divided into a public edition and a non-public (ÚFAL-internal) edition. The public edition is distributed via LINDAT-CLARIAH-CZ and contains 29 datasets for 19 languages (1 dataset for Ancient Greek, 1 for Ancient Hebrew, 1 for Catalan, 3 for Czech, 1 for Dutch, 4 for English, 3 for French, 2 for German, 1 for Hindi, 2 for Hungarian, 1 for Korean, 1 for Latin, 1 for Lithuanian, 2 for Norwegian, 1 for Old Church Slavonic, 1 for Polish, 1 for Russian, 1 for Spanish, and 1 for Turkish), excluding test portions. The non-public edition is available internally to ÚFAL members and includes an additional 4 datasets for 2 languages (1 for Dutch and 3 for English) that cannot be redistributed due to licensing restrictions. It also contains the test portions for all datasets. When using any of the harmonized datasets, please review the respective license (available in the same directory as the data) and cite the original resource. Compared to version 1.3, version 1.4 introduces new languages and corpora: Czech-PDTSC, Latin-CorefLat, Dutch-OpenBoek, English-FantasyCoref, and French-LitBankFr. The last three consist of long literary documents. In addition, English-GUM, Czech-PCEDT, and Czech-PDT have been updated to newer releases. A detailed list of changes for each dataset is provided in the corresponding README file.

Identifier
PID	http://hdl.handle.net/11234/1-6108
Related Identifier	http://hdl.handle.net/11234/1-5896
Related Identifier	https://ufal.mff.cuni.cz/corefud
Metadata Access	http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:11234/1-6108

Provenance
Creator	Novák, Michal; Popel, Martin; Zeman, Daniel; Žabokrtský, Zdeněk; Nedoluzhko, Anna; Acar, Kutay; Bamman, David; Bourgois, Antoine; Bourgonje, Peter; Cinková, Silvie; Delfino, Eleonora; Eckhoff, Hanne; Cebiroğlu Eryiğit, Gülşen; Hajič, Jan; Han, Sooyoun; Hardmeier, Christian; Haug, Dag; Jørgensen, Tollef; Kåsen, Andre; Krielke, Pauline; Landragin, Frédéric; Lapshinova-Koltunski, Ekaterina; Leotta, Roberta Grazia; Mæhlum, Petter; Martí, M. Antònia; Mélanie-Becquet, Frédérique; Mikulová, Marie; Milintsevich, Kirill; Moretti, Giovanni; Mujadia, Vandan; Muzerelle, Judith; Nam, Sangha; Nøklestad, Anders; Ogrodniczuk, Maciej; Øvrelid, Lilja; Pamay Arslan, Tuğba; Passarotti, Marco; Poibeau, Thierry; Porada, Ian; Recasens, Marta; Seo, Sumin; Solberg, Per Erik; Stede, Manfred; Štěpánek, Jan; Štěpánková, Barbora; Straka, Milan; Swanson, Daniel; Toldova, Svetlana; Vadász, Noémi; van Cranenburgh, Andreas; Velldal, Erik; Vincze, Veronika; Zeldes, Amir; Žitkus, Voldemaras
Publisher	Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication Year	2026
Rights	Licence CorefUD v1.4; https://lindat.mff.cuni.cz/repository/static/license-corefud-1.4.html; PUB
OpenAccess	true
Contact	lindat-help(at)ufal.mff.cuni.cz

Representation
Language	Greek, Ancient (to 1453); Catalan; Valencian; Czech; Dutch; Flemish; English; French; German; Hindi; Hungarian; Korean; Latin; Lithuanian; Norwegian; Church Slavic; Old Slavonic; Church Slavonic; Old Bulgarian; Old Church Slavonic; Polish; Russian; Spanish; Castilian; Turkish
Resource Type	corpus
Format	application/octet-stream; application/zip; downloadable_files_count: 1
Discipline	Linguistics