CLARIN - Repositories

Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2017 – VERSION 1)

german version see below The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the...

Corpus Tècnic de l'IULA

domain specific corpus (Law, Economy, Computing, Medicine and Environment as well as a contrastive corpus from the press); EN 3.3 M tokens, SP 33 M tokens, CAT 19 M tokens;...

Morphological Analyzer for Shipibo-Konibo

This tool is the first morphological analyzer ever for this language. The analyzer is a FST that produces all possible segmentations and tagging sequences in a word-by-word...

UPUS Corpus

Video-taped interviews and peer conversations from aprox 55 adolescents living in multilingual and multicultural communities in Oslo.

jusText

jusText is a heuristic based boilerplate removal tool useful for cleaning documents in large textual corpora. The tool has been implemented in Python, licensed under New BSD...

ROMLEX

Lexical database covering 25 Romani dialects with translations into English and, for some dialects, other European languages.

TITUS Buddhist Sanskrit

ca. 200.000 tokens; linked with relational database; XML-encoding in progress

Totoli corpus

Documentation of the Totoli project (DoBeS project)

PDT-Vallex: Czech Valency lexicon linked to treebanks 4.0 (PDT-Vallex 4.0)

The valency lexicon PDT-Vallex 4.0 has been built in close connection with the annotation of the Prague Dependency Treebank project (PDT) and its successors (mainly the Prague...

Saxophone Trills Dataset

This is the audio data of saxophone trills, used for difficulty estimation in the paper "Modeling the difficulty of saxophone music" by Šimon Libřický and Jan Hajič jr., ISMIR...

Concise dictionary of Latvian

25 000 entries

Corpus of Old Literary Finnish

This is a linguistically unannotated corpus of various historical texts written between 1543 and 1809. The corpus consists of 3,428,618 words and is available for online browsing.

SynSemClass3.0

The SynSemClass synonym verb lexicon is a result of a project investigating semantic ‘equivalence’ of verb senses and their valency behavior in parallel Czech-English language...

Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2021 – VERSION 1)

german version see below The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the...

Logos : multilingual e-translation portal

Searchable multilingual text collection (700+ mwd) and a dictionary database of 251 languages and dialects. The Dictionary (ca. 8 mwd) provides translation of a word,...

Lefff 2.0

100.000 entries, text

documentArchiv.de / Historische Dokumenten- und Quellensammlung zur deutschen...

Documents on German history (e.g. German Empire; Weimar Republic; National Socialism; Federal Republic of Germany; German Democratic Republic); Dokumente zur deutschen...

TaKIPI

morphosyntactic tagger working on the IPI PAN Corpus tagset;

GerManC : A representative historical corpus of German 1650-1800

The ultimate aim of the project is to compile a representative historical corpus of written German for the years 1650-1800. The complete GerManC corpus will contain 2000 word...

Prague Discourse Treebank 3.0

The Prague Discourse Treebank 3.0 (PDiT 3.0) is a new version of annotation of discourse relations marked by primary and secondary discourse connectives in the data of the...

4,938 datasets found