CLARIN - Repositories

QT21 Data

Post-editing and MQM annotations produced by the QT21 project. As described in @InProceedings{specia-etal_MTSummit:2017, author = {Specia, Lucia and Kim Harris and...

KER - Keyword Extractor

KER is a keyword extractor that was designed for scanned texts in Czech and English. It is based on the standard tf-idf algorithm with the idf tables trained on texts from...

POPPINS

Document classifier

Czech Lexico-Semantic Database 0.1

A lexicographical project, whose aim is to digitize and align two Czech onomasiological dictionaries (Haller 1969–77; Klégr 2007) in order to create an integrated digital...

IFA speech corpus

Spoken corpus containing speech of 4 male and 4 female speakers. 50,000 words segmented at phoneme level

Korpus 90

written, general language; 22 million tokens

The National Certificates corpus

The NC test results, background information, speaking and writing performances in 9 foreign / second languages. A web-based data base (html files).

Hunglish Corpus

Billingual written general; 2 million sentences

LongEval Click-Model Relevance Judgements (Qrels)

The collection comprises the relevance judgments used in the 2023 LongEval Information Retrieval Lab (https://clef-longeval.github.io/), organized at CLEF. It consists of three...

Terminal-based CoNLL-file viewer, v2

A simple way of browsing CoNLL format files in your terminal. Fast and text-based. To open a CoNLL file, simply run: ./view_conll sample.conll The output is piped through less,...

Universal Dependencies 1.3

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

Individual Textual Profiles of Hillary Clinton and Donald Trump

This corpus consists of full transcriptions of both Democratic and Republican 2016 presidential candidate debates, with a special focus on the idiolects of Hillary Clinton and...

Project Gutenberg

Possibility to download or to browse free electronic books; Angebot: Download von und Online-Zugang zu frei verfügbaren E-Books; deutschsprachige Literatur stellt nur einen...

LAC Nqeq Corpus

Language and Cognition corpus

Plant names in Dutch dialect (PLAND)

Plant names in Dutch dialect

Gesprächanalytisches Informationssystem (GAIS)

web-based information system on scientific community (news, events, persons, job market, mailing list, database on research projects and corpora, bibliography, glossary and...

Test Data EN-DE APE Shared Task WMT17

Test data for the WMT 2017 Automatic post-editing task (the same used for the Sentence-level Quality Estimation task). They consist in 2,000 English-German pairs (source and...

TITUS Laz

ca. 900 tokens

MEd

MEd is an annotation tool in which linearly-structured annotations of text or audio data can be created and edited. The tool supports multiple stacked layers of annotations that...

TITUS Prakrit

ca. 7.000 tokens; linked with relational database; XML-encoding in progress

4,938 datasets found