Dataset - B2FIND

INEL Kalmyk Corpus

Corpus citation Baranova, Vlada. 2025. INEL Kalmyk Corpus. Archived at Universität Hamburg. Version 1.0. Publication date...

German Twitter Titling Corpus

The German Titling Twitter Corpus consists of 1904 stance-annotated tweets collected in June/July 2018 mentioning 24 German politicians with a doctoral degree. The Addendum...

WikiWarsDE Corpus

The WikiWarsDE corpus is a German corpus containing Wikipedia articles with annotations of temporal expressions. Its creation was motivated by the English WikiWars corpus (Mazur...

STYX 1.0 (2017-10-03)

STYX 1.0 is a corpus of Czech sentences selected from the Prague Dependency treebank. The criterion for including sentences into STYX was their suitability for practicing Czech...

Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2019 – VERSION 1)

german version see below The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the...

Etalon 1.0

Etalon is a manually annotated corpus of contemporary Czech. The corpus contains 1,885,589 words (2,265,722 tokens) and is annotated in the same way as SYN2020 of the Czech...

STYX 1.0

STYX 1.0 is a corpus of Czech sentences selected from the Prague Dependency treebank. The criterion for including sentences into STYX was their suitability for practicing Czech...

Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2013 – VERSION 1)

german version see below The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the...

ParCorFull: A Parallel Corpus Annotated with Full Coreference

ParCorFull is a parallel corpus annotated with full coreference chains that has been created to address an important problem that machine translation and other multilingual...

Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2020 – VERSION 1)

german version see below The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the...

The Diorisis Ancient Greek Corpus

An annotated corpus of literary Ancient Greek sourced from the Perseus Canonical Greek Lit repository (https://github.com/PerseusDL/canonical-greekLit), “The Little Sailing”...

Czech RST Discourse Treebank 1.0

The Czech RST Discourse Treebank 1.0 (CzRST-DT 1.0) is a dataset of 54 Czech journalistic texts manually annotated using the Rhetorical Structure Theory (RST). Each text...

KAMOKO: KAsseler MOrgenstern KOrpus

KAMOKO is a structured and commented french learner-corpus. It addresses the central structures of the French language from a linguistic perspective (18 different courses). The...

KAMOKO: KAsseler MOrgenstern KOrpus (2021-02-09)

KAMOKO is a structured and commented french learner-corpus. It addresses the central structures of the French language from a linguistic perspective (18 different courses). The...

Large-Scale Colloquial Persian 0.5

"Large Scale Colloquial Persian Dataset" (LSCP) is hierarchically organized in asemantic taxonomy that focuses on multi-task informal Persian language understanding as a...

HetWiK: Heterogene Widerstandskulturen

The representative full-text digitalized HetWiK corpus is composed of 140 manually annotated texts of the German Resistance between 1933 and 1945. This includes both well-known...

OpenLegalData (2022 - Corpus)

OpenLegalData is a free and open platform that makes legal documents and information available to the public. The aim of this platform is to improve the transparency of...

Szeged Corpus 2.0

written, monolingual, general, manually POS annotated reference corpus; 1,459,288 tokens; MSD tagset, XML (TEI P4) files

HamleDT 3.0

HamleDT (HArmonized Multi-LanguagE Dependency Treebank) is a compilation of existing dependency treebanks (or dependency conversions of other treebanks), transformed so that...

QTLeap WSD/NED corpus

This corpora is part of Deliverable 5.5 of the European Commission project QTLeap FP7-ICT-2013.4.1-610516 (http://qtleap.eu). The texts are Q&A interactions from the...

36 datasets found