Dataset - B2FIND

Multiword Expressions lexicon extracted from the Gigafida 2.1 corpus

The MWE lexicon was extracted from the Gigafida 2.1 Corpus of Written Standard Slovene https://www.clarin.si/ske/#dashboard?corpname=gfida21) using specialized scripts for...

Deep Sequoia corpus - PARSEME-FR corpus - FrSemCor

The Sequoia corpus is a set of 3,099 linguistically-annotated French sentences, originating from four sources (Europarl, European Agency Reports, French regional journal L'Est...

ParaDi 2.0

ParaDi 2.0. is a dictionary of single verb paraphrases of Czech verbal multiword expressions - light verb constructions and idiomatic verb constructions. Moreover, it provides...

Multiword expressions in the Prague Dependency Treebank 2.0

This dataset adds annotation of multiword expressions and multiword named entities to the original PDT 2.0 data. The annotation is stand-off, stored in the same PML format as...

Czech Verbal MWEs

Lexicon of Czech verbal multiword expressions (VMWEs) used in Parseme Shared Task 2017....

Prague Dependency Treebank 2.5

The Prague Dependency Treebank 2.5 annotates the same texts as the PDT 2.0. The annotation on the original four layers was fixed or improved in various aspects (see...

Czech Multiword Expressions

The dataset contains 4731 frozen continuous Czech multiword expressions. Inflectional word forms are generated for those MWEs where applicable. In total, the dataset contains...

Prague Dependency Treebank - Consolidated 2.0 (PDT-C 2.0)

A manually annotated and genre-diversified language resource with rich linguistic information from morphology and syntax to semantics, the Prague Dependency Treebank –...

Annotated corpora and tools of the PARSEME Shared Task on Semi-Supervised Ide...

This multilingual resource contains corpora in which verbal MWEs have been manually annotated, gathered at the occasion of the 1.2 edition of the PARSEME Shared Task on...

PARSEME corpora annotated for verbal multiword expressions (version 1.3)

This multilingual resource contains corpora in which verbal MWEs have been manually annotated. VMWEs include idioms (let the cat out of the bag), light-verb constructions (make...

Prague Dependency Treebank - Consolidated 1.0 (PDT-C 1.0)

A richly annotated and genre-diversified language resource, The Prague Dependency Treebank – Consolidated 1.0 (PDT-C 1.0, or PDT-C in short in the sequel) is a consolidated...

Gold Standard Reference Data for Multiword Expression Extraction: Czech Depen...

Annotated list of dependency bigrams occurring in the PDT more than five times and having part-of-speech patterns that can possibly form a collocation. Each bigram is assigned...

ParaDi 2.0 (2018-01-24)

ParaDi 2.0. is a dictionary of single verb paraphrases of Czech verbal multiword expressions - light verb constructions and idiomatic verb constructions. Moreover, it provides...

Prague Dependency Treebank 3.5

The Prague Dependency Treebank 3.5 is the 2018 edition of the core Prague Dependency Treebank (PDT). It contains all PDT annotation made at the Institute of Formal and Applied...

Croatian linguistic training corpus hr500k 2.0

The hr500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and...

Dependency tree extraction tool STARK 3.0

STARK is a highly customizable tool designed for extracting different types of syntactic structures (trees) from parsed corpora (treebanks), aimed at corpus-driven linguistic...

IMP corpus n-grams 2.0

A collection of n-grams extracted from the IMP corpus of historical Slovene (cf. https://nl.ijs.si/imp/). Three sets of n-gram lists are provided for lowercased word n-grams of...

KRES corpus n-grams 1.0

This is a collection of n-grams extracted from the KRES corpus of written Slovene. In addition to the separate lists of n-grams for tokens and their attributes (morphosyntacic...

IMP corpus n-grams 1.0

This is a collection of n-grams extracted from the IMP corpus of historical Slovene (http://hdl.handle.net/11356/1031). In addition to the separate lists of n-grams for tokens...

Automatically constructed multiword lexicon hrMWELex v0.5

The hrMWELex lexicon is an automatically constructed lexicon of Croatian multiword expression candidates (mostly collocations) from the parsed hrWaC 2.0 corpus by using the...

33 datasets found