Dataset - B2FIND

C4Corpus (publicdomain part)

A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly...

Universal Dependencies 1.2

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

C4Corpus (CC-BY part)

A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly...

Universal Dependencies 2.5

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

C4Corpus (CC BY-SA part)

A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly...

Universal Dependencies 2.12 models for UDPipe 2 (2023-07-17)

Tokenizer, POS Tagger, Lemmatizer and Parser models for 131 treebanks of 72 languages of Universal Depenencies 2.12 Treebanks, created solely using UD 2.12 data...

Coreference in Universal Dependencies 0.2 (CorefUD 0.2)

CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version...

IWPT 2020 Shared Task Data and System Outputs

This package contains data used in the IWPT 2020 shared task. It contains training, development and test (evaluation) datasets. The data is based on a subset of Universal...

Plaintext Wikipedia dump 2018

Wikipedia plain text data obtained from Wikipedia dumps with WikiExtractor in February 2018. The data come from all Wikipedias for which dumps could be downloaded at...

Universal Dependencies 2.8.1

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

The IPI PAN Corpus

written, general, monolingual, synchronic; 250 million; XML (XCES), morphosyntactic, structural, metada

Multilingual static embeddings for Verbal Multiword Expressions trained on PA...

This resource is a set of 14 vector spaces for single words and Verbal Multiword Expressions (VMWEs) in different languages (German, Greek, Basque, French, Irish, Hebrew, Hindi,...

TITUS Old Polish

ca. 5.000 tokens; linked with relational database; XML-encoding in progress

SpeechDat-East databases

5 telephone databases recorded over the PSTN. Contains interesting phonetically rich material. All orthographically transcribed. Speaker information included for gender, age,...

Deep Universal Dependencies 2.7

Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-3424). It contains additional...

JRC-Acquis

The largest parallel corpus, contains EU law, the Acquis Communautaire in 22 languages.

Universal Dependencies 2.4

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

CoNLL 2018 Shared Task System Outputs

Test data parsed by systems submitted to the CoNLL 2018 UD parsing shared task.

Universal Dependencies 2.8

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

Universal Dependencies 2.2

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

653 datasets found