Dataset - B2FIND

L1 & L2 Acquisition Marzena Watorek French Project

Language Acquisition corpus

DaMuEL 1.0: A Large Multilingual Dataset for Entity Linking

We present DaMuEL, a large Multilingual Dataset for Entity Linking containing data in 53 languages. DaMuEL consists of two components: a knowledge base that contains...

Universal Dependencies 2.6 models for UDPipe 2 (2020-08-31)

Tokenizer, POS Tagger, Lemmatizer and Parser models for 99 treebanks of 63 languages of Universal Depenencies 2.6 Treebanks, created solely using UD 2.6 data...

W2C – Web to Corpus – Corpora

A set of corpora for 120 languages automatically collected from wikipedia and the web. Collected using the W2C toolset: http://hdl.handle.net/11858/00-097C-0000-0022-60D6-1

Annotated corpora and tools of the PARSEME Shared Task on Semi-Supervised Ide...

This multilingual resource contains corpora in which verbal MWEs have been manually annotated, gathered at the occasion of the 1.2 edition of the PARSEME Shared Task on...

Extended CLEF eHealth 2013-2015 IR Test Collection

This package contains an extended version of the test collection used in the CLEF eHealth Information Retrieval tasks in 2013--2015. Compared to the original version, it...

National Corpus of Polish

In (advanced) preparation: a reference corpus of Polish language containing hundreds millions of words.

Universal Derivations v0.5

Universal Derivations (UDer) is a collection of harmonized lexical networks capturing word-formation, especially derivational relations, in a cross-linguistically consistent...

C4Corpus (CC BY-NC-ND part)

A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly...

Universal Dependencies 2.1

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

UDify Pretrained Model

Pretrained model weights for the UDify model, and extracted BERT weights in pytorch-transformers format. Note that these weights slightly differ from those used in the paper.

CorPipe 24 Multilingual CorefUD 1.2 Model (corpipe24-corefud1.2-240906)

The corpipe24-corefud1.2-240906 is a mT5-large-based multilingual model for coreference resolution usable in CorPipe 24 (https://github.com/ufal/crac2024-corpipe). It is...

Universal Dependencies 2.0 – CoNLL 2017 Shared Task Development and Test Data

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

CorPipe 23 multilingual CorefUD 1.1 model (corpipe23-corefud1.1-231206)

The corpipe23-corefud1.1-231206 is a mT5-large-based multilingual model for coreference resolution usable in CorPipe 23 (https://github.com/ufal/crac2023-corpipe). It is...

Deep Universal Dependencies 2.5

Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-3105). It contains additional...

CUBBITT Translation Models (en-pl) (v1.0)

CUBBITT En-Pl translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). Models are...

Universal Dependencies 2.15

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

Deep Universal Dependencies 2.6

Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-3226). It contains additional...

PAWS

PAWS is a multi-lingual parallel treebank with coreference annotation. It consists of English texts from the Wall Street Journal translated into Czech, Russian and Polish. In...

PARSEME corpora annotated for verbal multiword expressions (version 1.3)

This multilingual resource contains corpora in which verbal MWEs have been manually annotated. VMWEs include idioms (let the cat out of the bag), light-verb constructions (make...

653 datasets found