-
INEL Kalmyk Corpus
Corpus citation Baranova, Vlada. 2025. INEL Kalmyk Corpus. Archived at Universität Hamburg. Version 1.0. Publication date... -
German Twitter Titling Corpus
The German Titling Twitter Corpus consists of 1904 stance-annotated tweets collected in June/July 2018 mentioning 24 German politicians with a doctoral degree. The Addendum... -
WikiWarsDE Corpus
The WikiWarsDE corpus is a German corpus containing Wikipedia articles with annotations of temporal expressions. Its creation was motivated by the English WikiWars corpus (Mazur... -
STYX 1.0 (2017-10-03)
STYX 1.0 is a corpus of Czech sentences selected from the Prague Dependency treebank. The criterion for including sentences into STYX was their suitability for practicing Czech... -
Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2019 – VERSION 1)
german version see below The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the... -
Etalon 1.0
Etalon is a manually annotated corpus of contemporary Czech. The corpus contains 1,885,589 words (2,265,722 tokens) and is annotated in the same way as SYN2020 of the Czech... -
STYX 1.0
STYX 1.0 is a corpus of Czech sentences selected from the Prague Dependency treebank. The criterion for including sentences into STYX was their suitability for practicing Czech... -
Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2013 – VERSION 1)
german version see below The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the... -
ParCorFull: A Parallel Corpus Annotated with Full Coreference
ParCorFull is a parallel corpus annotated with full coreference chains that has been created to address an important problem that machine translation and other multilingual... -
Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2020 – VERSION 1)
german version see below The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the... -
The Diorisis Ancient Greek Corpus
An annotated corpus of literary Ancient Greek sourced from the Perseus Canonical Greek Lit repository (https://github.com/PerseusDL/canonical-greekLit), “The Little Sailing”... -
Czech RST Discourse Treebank 1.0
The Czech RST Discourse Treebank 1.0 (CzRST-DT 1.0) is a dataset of 54 Czech journalistic texts manually annotated using the Rhetorical Structure Theory (RST). Each text... -
KAMOKO: KAsseler MOrgenstern KOrpus
KAMOKO is a structured and commented french learner-corpus. It addresses the central structures of the French language from a linguistic perspective (18 different courses). The... -
KAMOKO: KAsseler MOrgenstern KOrpus (2021-02-09)
KAMOKO is a structured and commented french learner-corpus. It addresses the central structures of the French language from a linguistic perspective (18 different courses). The... -
Large-Scale Colloquial Persian 0.5
"Large Scale Colloquial Persian Dataset" (LSCP) is hierarchically organized in asemantic taxonomy that focuses on multi-task informal Persian language understanding as a... -
HetWiK: Heterogene Widerstandskulturen
The representative full-text digitalized HetWiK corpus is composed of 140 manually annotated texts of the German Resistance between 1933 and 1945. This includes both well-known... -
OpenLegalData (2022 - Corpus)
OpenLegalData is a free and open platform that makes legal documents and information available to the public. The aim of this platform is to improve the transparency of... -
Szeged Corpus 2.0
written, monolingual, general, manually POS annotated reference corpus; 1,459,288 tokens; MSD tagset, XML (TEI P4) files -
HamleDT 3.0
HamleDT (HArmonized Multi-LanguagE Dependency Treebank) is a compilation of existing dependency treebanks (or dependency conversions of other treebanks), transformed so that... -
QTLeap WSD/NED corpus
This corpora is part of Deliverable 5.5 of the European Commission project QTLeap FP7-ICT-2013.4.1-610516 (http://qtleap.eu). The texts are Q&A interactions from the...