Dataset - B2FIND

C4Corpus (CC BY-NC part)

A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly...

Universal Dependencies 2.6

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

Świgra

Implementation of Świdziński's formal grammar of Polish. Requires a parser (Birnam parser available as a separate tool) and a morphological analyser (no free analyser for...

IWPT 2021 Shared Task Data and System Outputs

This package contains data used in the IWPT 2021 shared task. It contains training, development and test (evaluation) datasets. The data is based on a subset of Universal...

Deep Universal Dependencies 2.4

Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-2988). It contains additional...

Morpho-syntactically annotated corpora provided for the PARSEME Shared Task o...

This multilingual resource contains corpora for 14 languages, gathered at the occasion of the 1.2 edition of the PARSEME Shared Task on semi-supervised Identification of Verbal...

Coreference in Universal Dependencies 1.1 (CorefUD 1.1)

CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version...

Universal Dependencies 1.4

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

Speecon databases

28 speech databases containing broadband recordings from 550 adults and 50 children per language. Contains interesting phonetically rich material. All orthographically...

Universal Dependencies 2.7

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

Universal Dependencies 2.10 models for UDPipe 2 (2022-07-11)

Tokenizer, POS Tagger, Lemmatizer and Parser models for 123 treebanks of 69 languages of Universal Depenencies 2.10 Treebanks, created solely using UD 2.10 data...

Deltacorpus

Texts in 107 languages from the W2C corpus (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9), first 1,000,000 tokens per language, tagged by the delexicalized tagger...

CoNLL 2017 Shared Task System Outputs

This package contains the system outputs from the CoNLL 2017 Shared Task in Multilingual Parsing from Raw Text to Universal Dependencies.

Multilingual corpus of literal occurrences of multiword expressions

The corpus contains sentences with idiomatic, literal and coincidental occurrences of verbal multiword expressions (VMWEs) in Basque, German, Greek, Polish and Portuguese. The...

Universal Dependencies 2.5 Models for UDPipe (2019-12-06)

Tokenizer, POS Tagger, Lemmatizer and Parser models for 94 treebanks of 61 languages of Universal Depenencies 2.5 Treebanks, created solely using UD 2.5 data...

Universal Dependencies 1.3

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

C4Corpus (CC BY-ND part)

A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly...

Universal Dependencies 2.3

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

Universal Dependencies 2.12

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

It-Sr-NER

It-Sr-NER tool is a CLARIN compatible NER web service for parallel texts with case study on Italian and Serbian; it can be used for recognizing and classifying named entities in...

653 datasets found