Dataset - B2FIND

Universal Dependencies 2.7

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

LongEval Test Collection

The collection consists of queries and documents provided by the Qwant search Engine (https://www.qwant.com). The queries, which were issued by the users of Qwant, are based on...

Manually Classified Errors in Cs->Sk Translation

Manual classification of errors of Czech-Slovak translation according to the classification introduced by Vilar et al. [1]. First 50 sentences from WMT 2010 test set were...

Universal Dependencies 2.6

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

Multilingual corpus of literal occurrences of multiword expressions

The corpus contains sentences with idiomatic, literal and coincidental occurrences of verbal multiword expressions (VMWEs) in Basque, German, Greek, Polish and Portuguese. The...

C4Corpus (publicdomain part)

A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly...

esCorpius: A Massive Spanish Crawling Corpus

In the recent years, Transformer-based models have lead to significant advances in language modelling for natural language processing. However, they require a vast amount of...

MORFO

The MORFO system for morphological analysis of Czech consists of four units: the analyzer, the generator, the dictionary editor, and the library with the shared source code for...

Universal Dependencies 2.14

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

W2C – Web to Corpus – tool

A tool used to build multilingual corpora from wikipedia. Download the web pages, convert them to plain text, identify language, etc. A set of 120 corpora collected using this...

Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2019 – VERSION 1)

german version see below The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the...

WMT18 Quality Estimation Shared Task Training and Development Data

Training and development data for the WMT18 QE task. Test data will be published as a separate item. This shared task will build on its previous six editions to further examine...

WMT16 Quality Estimation Shared Task Training and Development Data

Training and development data for the WMT16 QE task. Test data will be published as a separate item. This shared task will build on its previous four editions to further examine...

Universal Dependencies 1.4

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

Etalon 1.0

Etalon is a manually annotated corpus of contemporary Czech. The corpus contains 1,885,589 words (2,265,722 tokens) and is annotated in the same way as SYN2020 of the Czech...

ORAL2008: Balanced corpus of informal spoken Czech

Balanced corpus of informal spoken Czech sized 1 MW. It contains transcriptions of 297 recordings made in 2002–2007 in the whole of Bohemia. All the recordings were made in...

Annotated corpora and tools of the PARSEME Shared Task on Automatic Identific...

The PARSEME shared task aims at identifying verbal MWEs in running texts. Verbal MWEs include idioms (let the cat out of the bag), light verb constructions (make a decision),...

VIADAT-REPO+DEPOSIT

VIADAT-REPO is an additional module to the lindat-dspace platform which allows for depositing data records in the field of oral history, including its specific metadata...

GrandStaff-LMX: Linearized MusicXML Encoding of the GrandStaff Dataset

The GrandStaff-LMX dataset is based on the GrandStaff dataset described in the "End-to-end optical music recognition for pianoform sheet music" paper by Antonio Ríos-Vila et...

Universal Dependencies 2.8

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

500 datasets found