36 datasets found

Keywords: annotated corpus

Filter Results
  • INEL Kalmyk Corpus

    Corpus citation Baranova, Vlada. 2025. INEL Kalmyk Corpus. Archived at Universität Hamburg. Version 1.0. Publication date...
  • German Twitter Titling Corpus

    The German Titling Twitter Corpus consists of 1904 stance-annotated tweets collected in June/July 2018 mentioning 24 German politicians with a doctoral degree. The Addendum...
  • WikiWarsDE Corpus

    The WikiWarsDE corpus is a German corpus containing Wikipedia articles with annotations of temporal expressions. Its creation was motivated by the English WikiWars corpus (Mazur...
  • STYX 1.0 (2017-10-03)

    STYX 1.0 is a corpus of Czech sentences selected from the Prague Dependency treebank. The criterion for including sentences into STYX was their suitability for practicing Czech...
  • Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2019 – VERSION 1)

    german version see below The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the...
  • Etalon 1.0

    Etalon is a manually annotated corpus of contemporary Czech. The corpus contains 1,885,589 words (2,265,722 tokens) and is annotated in the same way as SYN2020 of the Czech...
  • STYX 1.0

    STYX 1.0 is a corpus of Czech sentences selected from the Prague Dependency treebank. The criterion for including sentences into STYX was their suitability for practicing Czech...
  • Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2013 – VERSION 1)

    german version see below The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the...
  • ParCorFull: A Parallel Corpus Annotated with Full Coreference

    ParCorFull is a parallel corpus annotated with full coreference chains that has been created to address an important problem that machine translation and other multilingual...
  • Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2020 – VERSION 1)

    german version see below The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the...
  • The Diorisis Ancient Greek Corpus

    An annotated corpus of literary Ancient Greek sourced from the Perseus Canonical Greek Lit repository (https://github.com/PerseusDL/canonical-greekLit), “The Little Sailing”...
  • Czech RST Discourse Treebank 1.0

    The Czech RST Discourse Treebank 1.0 (CzRST-DT 1.0) is a dataset of 54 Czech journalistic texts manually annotated using the Rhetorical Structure Theory (RST). Each text...
  • KAMOKO: KAsseler MOrgenstern KOrpus

    KAMOKO is a structured and commented french learner-corpus. It addresses the central structures of the French language from a linguistic perspective (18 different courses). The...
  • KAMOKO: KAsseler MOrgenstern KOrpus (2021-02-09)

    KAMOKO is a structured and commented french learner-corpus. It addresses the central structures of the French language from a linguistic perspective (18 different courses). The...
  • Large-Scale Colloquial Persian 0.5

    "Large Scale Colloquial Persian Dataset" (LSCP) is hierarchically organized in asemantic taxonomy that focuses on multi-task informal Persian language understanding as a...
  • HetWiK: Heterogene Widerstandskulturen

    The representative full-text digitalized HetWiK corpus is composed of 140 manually annotated texts of the German Resistance between 1933 and 1945. This includes both well-known...
  • OpenLegalData (2022 - Corpus)

    OpenLegalData is a free and open platform that makes legal documents and information available to the public. The aim of this platform is to improve the transparency of...
  • Szeged Corpus 2.0

    written, monolingual, general, manually POS annotated reference corpus; 1,459,288 tokens; MSD tagset, XML (TEI P4) files
  • HamleDT 3.0

    HamleDT (HArmonized Multi-LanguagE Dependency Treebank) is a compilation of existing dependency treebanks (or dependency conversions of other treebanks), transformed so that...
  • QTLeap WSD/NED corpus

    This corpora is part of Deliverable 5.5 of the European Commission project QTLeap FP7-ICT-2013.4.1-610516 (http://qtleap.eu). The texts are Q&A interactions from the...
You can also access this registry using the API (see API Docs).