-
QT21 Data
Post-editing and MQM annotations produced by the QT21 project. As described in @InProceedings{specia-etal_MTSummit:2017, author = {Specia, Lucia and Kim Harris and... -
KER - Keyword Extractor
KER is a keyword extractor that was designed for scanned texts in Czech and English. It is based on the standard tf-idf algorithm with the idf tables trained on texts from... -
POPPINS
Document classifier -
Czech Lexico-Semantic Database 0.1
A lexicographical project, whose aim is to digitize and align two Czech onomasiological dictionaries (Haller 1969–77; Klégr 2007) in order to create an integrated digital... -
IFA speech corpus
Spoken corpus containing speech of 4 male and 4 female speakers. 50,000 words segmented at phoneme level -
Korpus 90
written, general language; 22 million tokens -
The National Certificates corpus
The NC test results, background information, speaking and writing performances in 9 foreign / second languages. A web-based data base (html files). -
Hunglish Corpus
Billingual written general; 2 million sentences -
LongEval Click-Model Relevance Judgements (Qrels)
The collection comprises the relevance judgments used in the 2023 LongEval Information Retrieval Lab (https://clef-longeval.github.io/), organized at CLEF. It consists of three... -
Terminal-based CoNLL-file viewer, v2
A simple way of browsing CoNLL format files in your terminal. Fast and text-based. To open a CoNLL file, simply run: ./view_conll sample.conll The output is piped through less,... -
Universal Dependencies 1.3
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual... -
Individual Textual Profiles of Hillary Clinton and Donald Trump
This corpus consists of full transcriptions of both Democratic and Republican 2016 presidential candidate debates, with a special focus on the idiolects of Hillary Clinton and... -
Project Gutenberg
Possibility to download or to browse free electronic books; Angebot: Download von und Online-Zugang zu frei verfügbaren E-Books; deutschsprachige Literatur stellt nur einen... -
LAC Nqeq Corpus
Language and Cognition corpus -
Plant names in Dutch dialect (PLAND)
Plant names in Dutch dialect -
Gesprächanalytisches Informationssystem (GAIS)
web-based information system on scientific community (news, events, persons, job market, mailing list, database on research projects and corpora, bibliography, glossary and... -
Test Data EN-DE APE Shared Task WMT17
Test data for the WMT 2017 Automatic post-editing task (the same used for the Sentence-level Quality Estimation task). They consist in 2,000 English-German pairs (source and... -
TITUS Laz
ca. 900 tokens -
MEd
MEd is an annotation tool in which linearly-structured annotations of text or audio data can be created and edited. The tool supports multiple stacked layers of annotations that... -
TITUS Prakrit
ca. 7.000 tokens; linked with relational database; XML-encoding in progress
