Dataset - B2FIND

L2 Acquisition P-Moll Norbert Dittmar

Language Acquisition corpus

Universal Dependencies 2.14

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

Universal Dependencies 1.4

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

Khresmoi Query Translation Test Data 2.0

This package contains data sets for development and testing of machine translation of medical queries between Czech, English, French, German, Hungarian, Polish, Spanish ans...

Annotated corpora and tools of the PARSEME Shared Task on Automatic Identific...

The PARSEME shared task aims at identifying verbal MWEs in running texts. Verbal MWEs include idioms (let the cat out of the bag), light verb constructions (make a decision),...

Universal Dependencies 2.8

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

CoNLL 2017 and 2018 Shared Task Blind and Preprocessed Test Data

CoNLL 2017 and 2018 shared tasks: Multilingual Parsing from Raw Text to Universal Dependencies This package contains the test data in the form in which they ware presented to...

SpeechDat-East databases

5 telephone databases recorded over the PSTN. Contains interesting phonetically rich material. All orthographically transcribed. Speaker information included for gender, age,...

Deep Universal Dependencies 2.5

Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-3105). It contains additional...

TITUS Old Polish

ca. 5.000 tokens; linked with relational database; XML-encoding in progress

Universal Segmentations 1.0 (UniSegments 1.0)

Universal Segmentations (UniSegments) is a collection of lexical resources capturing morphological segmentations harmonised into a cross-linguistically consistent annotation...

OmegaWiki

This dataset has no description

Morfeusz

Morfeusz is a morphological analyser (not stemmer, not tagger) for Polish, withouth a guesser - so it's a morphological dictionary of a kind. Note it's a library, not a ready...

Universal Derivations v1.1

Universal Derivations (UDer) is a collection of harmonized lexical networks capturing word-formation, especially derivational relations, in a cross-linguistically consistent...

CUBBITT Translation Models (en-pl) (v1.0)

CUBBITT En-Pl translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). Models are...

Preamble 1.0

Preamble 1.0 is a multilingual annotated corpus of the preamble of the EU REGULATION 2020/2092 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL. The corpus consists of four...

Universal Dependencies 2.4 Models for UDPipe (2019-05-31)

Tokenizer, POS Tagger, Lemmatizer and Parser models for 90 treebanks of 60 languages of Universal Depenencies 2.4 Treebanks, created solely using UD 2.4 data...

CorPipe 23 multilingual CorefUD 1.2 model (corpipe23-corefud1.2-240906)

The corpipe23-corefud1.2-240906 is a mT5-large-based multilingual model for coreference resolution usable in CorPipe 23 https://github.com/ufal/crac2023-corpipe. It is released...

C4Corpus (CC BY-SA part)

A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly...

National Corpus of Polish

In (advanced) preparation: a reference corpus of Polish language containing hundreds millions of words.

749 datasets found