CLARIN - Repositories

Individual Textual Profiles of Hillary Clinton and Donald Trump

This corpus consists of full transcriptions of both Democratic and Republican 2016 presidential candidate debates, with a special focus on the idiolects of Hillary Clinton and...

Test Data EN-DE APE Shared Task WMT17

Test data for the WMT 2017 Automatic post-editing task (the same used for the Sentence-level Quality Estimation task). They consist in 2,000 English-German pairs (source and...

MEd

MEd is an annotation tool in which linearly-structured annotations of text or audio data can be created and edited. The tool supports multiple stacked layers of annotations that...

Semantic Features and Their Role In Conceptual Representation In School Age C...

Language acquisition is one of the currently much discussed topics in the field of psycholinguistics. Considerable space for future research can be seen in the development of...

EnTam: An English-Tamil Parallel Corpus (EnTam v2.0)

EnTam is a sentence aligned English-Tamil bilingual corpus from some of the publicly available websites that we have collected for NLP research involving Tamil. The standard set...

Prague Dependency Treebank 2.0 - sample data

A small subset of PDT 2.0 made available under a permissive license. Prague Dependency Treebank 2.0 (PDT 2.0) contains a large amount of Czech texts with complex and interlinked...

STYX 1.0 (2017-10-03)

STYX 1.0 is a corpus of Czech sentences selected from the Prague Dependency treebank. The criterion for including sentences into STYX was their suitability for practicing Czech...

Amharic WIC Corpus

Substantially cleaned version of existing morphologically annotated WIC Corpus.

Czech Models (MorfFlex CZ + PDT) for MorphoDiTa

Czech models for MorphoDiTa, providing morphological analysis, morphological generation and part-of-speech tagging. The morphological dictionary is created from MorfFlex CZ and...

LongEval Test Collection

The collection consists of queries and documents provided by the Qwant search Engine (https://www.qwant.com). The queries, which were issued by the users of Qwant, are based on...

Coreference in Universal Dependencies 1.3 (CorefUD 1.3)

CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version...

Manually Classified Errors in Cs->Sk Translation

Manual classification of errors of Czech-Slovak translation according to the classification introduced by Vilar et al. [1]. First 50 sentences from WMT 2010 test set were...

PMLTQ::Web

Simple web build on the top of the PML Tree Query service.

Digital Humanities Courses at Czech Colleges 2017/2018

Titles of courses possibly relevant to the Digital Humanities for 2017-2018, manually gathered from course catalogues of most Czech state colleges, including the names of the...

LiFR-Law. Corpus of Paraphrased Czech Administrative Texts with Reading Compr...

LiFR-Law is a corpus of Czech legal and administrative texts with measured reading comprehension and a subjective expert annotation of diverse textual properties based on the...

Visible Vowels (2017-05-29)

This program enables the user to visualize f0 contours, to plot vowels in the F1/F2 space for multiple points in the vowel interval, e.g. at 20%, 50% and 80%, and to visualize...

Malach Center User Interface 1.0

Source code of the first full and running version for the Malach Center User Interface, does not contain data or metadata fo the digital objects and resources.

NameTag 3 Multilingual CoNLL Model

This is a trained model for the supervised machine learning tool NameTag 3 (https://ufal.mff.cuni.cz/nametag/3/), trained jointly on several NE corpora: English CoNLL-2003,...

Image Annotation Tool

Image annotation tool is a web application that allows users to mark zones of interest in an image. These zones are then converted to TEI P5 code snippet that can be used in...

Artificial Treebank with Ellipsis

Artificially created treebank of elliptical constructions (gapping), in the annotation style of Universal Dependencies. Data taken from UD 2.1 release, and from large web...

1,492 datasets found