-
L2 Acquisition P-Moll Norbert Dittmar
Language Acquisition corpus -
Universal Dependencies 2.14
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual... -
Universal Dependencies 1.4
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual... -
Khresmoi Query Translation Test Data 2.0
This package contains data sets for development and testing of machine translation of medical queries between Czech, English, French, German, Hungarian, Polish, Spanish ans... -
Annotated corpora and tools of the PARSEME Shared Task on Automatic Identific...
The PARSEME shared task aims at identifying verbal MWEs in running texts. Verbal MWEs include idioms (let the cat out of the bag), light verb constructions (make a decision),... -
Universal Dependencies 2.8
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual... -
CoNLL 2017 and 2018 Shared Task Blind and Preprocessed Test Data
CoNLL 2017 and 2018 shared tasks: Multilingual Parsing from Raw Text to Universal Dependencies This package contains the test data in the form in which they ware presented to... -
SpeechDat-East databases
5 telephone databases recorded over the PSTN. Contains interesting phonetically rich material. All orthographically transcribed. Speaker information included for gender, age,... -
Deep Universal Dependencies 2.5
Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-3105). It contains additional... -
TITUS Old Polish
ca. 5.000 tokens; linked with relational database; XML-encoding in progress -
Universal Segmentations 1.0 (UniSegments 1.0)
Universal Segmentations (UniSegments) is a collection of lexical resources capturing morphological segmentations harmonised into a cross-linguistically consistent annotation... -
OmegaWiki
This dataset has no description
-
Morfeusz
Morfeusz is a morphological analyser (not stemmer, not tagger) for Polish, withouth a guesser - so it's a morphological dictionary of a kind. Note it's a library, not a ready... -
Universal Derivations v1.1
Universal Derivations (UDer) is a collection of harmonized lexical networks capturing word-formation, especially derivational relations, in a cross-linguistically consistent... -
CUBBITT Translation Models (en-pl) (v1.0)
CUBBITT En-Pl translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). Models are... -
Preamble 1.0
Preamble 1.0 is a multilingual annotated corpus of the preamble of the EU REGULATION 2020/2092 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL. The corpus consists of four... -
Universal Dependencies 2.4 Models for UDPipe (2019-05-31)
Tokenizer, POS Tagger, Lemmatizer and Parser models for 90 treebanks of 60 languages of Universal Depenencies 2.4 Treebanks, created solely using UD 2.4 data... -
CorPipe 23 multilingual CorefUD 1.2 model (corpipe23-corefud1.2-240906)
The corpipe23-corefud1.2-240906 is a mT5-large-based multilingual model for coreference resolution usable in CorPipe 23 https://github.com/ufal/crac2023-corpipe. It is released... -
C4Corpus (CC BY-SA part)
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly... -
National Corpus of Polish
In (advanced) preparation: a reference corpus of Polish language containing hundreds millions of words.
