CLARIN - Repositories

Testimonies of Roma and Sinti

The key idea of our project is to convey to the widest possible readership detailed abstracts of the testimonies of Roma and Sinti and thus their personal and irreplaceable...

ORAL2006: Corpus of informal spoken Czech

Corpus of informal spoken Czech sized 1 MW. It contains transcriptions of 221 recordings made in 2002–2006 in the whole of Bohemia. All the recordings were made in informal...

ORTOFON v1: balanced corpus of informal spoken Czech with multi-tier transcri...

ORTOFON v1 is designed as a representation of authentic spoken Czech used in informal situations (private environment, spontaneity, unpreparedness etc.) in the area of the whole...

Additional German-Czech reference translations of the WMT'11 test set

Additional three Czech reference translations of the whole WMT 2011 data set (http://www.statmt.org/wmt11/test.tgz), translated from the German originals. Original segmentation...

Corpus from the Aozora Bunko Library

This corpus contains a subset of available texts from the Aozora Bunko public library project, which contains various works of mostly older literature in Japanese. A custom...

SLäNDa 2.0

SLäNDa, the Swedish literature corpus of narrative and dialogue, is a corpus made up of eight Swedish literary novels from the 19th and early 20th centuries, manually annotated...

Arabic Triliteral roots Lexicon

Description: This xml file is a lexicon containing all 21952 (28x28x28) Arabic triliteral combinations (roots). the file is split into three parts as follow: the first part...

A Gold Standard Word Alignment for English-Swedish (2015-10-12)

A Gold Standard Word Alignment for English-Swedish (GES) is a resource containing 1164 manually word aligned sentences pairs from English and Swedish versions of Europarl v. 2.

Corpus of contemporary blogs

In NLP Centre, dividing text into sentences is currently done with a tool which uses rule-based system. In order to make enough training data for machine learning, annotators...

Quality and Efficiency of Manual Annotation: Data from the Pre-annotation Bia...

Input data, individual experimental annotations, and a complete and detailed overview of the measured results related to the experiment described in the referenced paper.

Slovak Dependency Treebank

Slovak Dependency Treebank (Slovenský závislostný korpus) was created as part of the Slovak National Corpus at the Ľ. Štúr Institute of the Slovak Academy of Sciences. The...

LongEval Train Collection

The collection consists of queries and documents provided by the Qwant search Engine (https://www.qwant.com). The queries, which were issued by the users of Qwant, are based on...

SynSemClass2.0

The SynSemClass synonym verb lexicon is a result of a project investigating semantic ‘equivalence’ of verb senses and their valency behavior in parallel Czech-English language...

Czech Named Entity Corpus 1.0

The presented Czech Named Entity Corpus 1.0 is the first publicly available corpus providing a large body of manually annotated named entities in Czech sentences, including a...

LatinISE corpus (version 4)

The LatinISE corpus is a text corpus collected from the LacusCurtius, Intratext and Musisque Deoque websites. Corpus texts have rich metadata containing information as genre,...

The Model latinpipe-evalatin24-240520 for LatinPipe 2024

The latinpipe-evalatin24-240520 is a PhilBerta-based model for LatinPipe 2024 https://github.com/ufal/evalatin2024-latinpipe, performing tagging, lemmatization, and dependency...

Test Data DE-EN APE Shared Task WMT17

Test data for the WMT 2017 Automatic post-editing task (the same used for the Sentence-level Quality Estimation task). They consist in German-English triplets (source and...

Italian Content Words

This resource is an Italian morphological dictionary for content words, encoded in a JSON Lines format text file. It contains correspondences between surface form and lexical...

ORAL2013: balanced corpus of informal spoken Czech (transcriptions & audio)

ORAL2013 is designed as a representation of authentic spoken Czech used in informal situations (private environment, spontaneity, unpreparedness etc.) in the area of the whole...

FicTree 1.0

FicTree is a dependency treebank of Czech fiction manually annotated in the format of the analytical layer of the Prague Dependency Trebank. The treebank consists of 12,760...

1,494 datasets found