-
Croatian linguistic training corpus hr500k 2.0
The hr500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and... -
Terminological multiword expressions lexicon
The Terminological Multiword Expressions Lexicon contains multiword terms extracted from various terminological sources. The entries were lemmatized and tagged according to the... -
Automatically constructed multiword lexicon srMWELex v0.5
The srMWELex lexicon is an automatically constructed lexicon of Serbian multiword expression candidates (mostly collocations) from the parsed srWaC 1.0 corpus by using the... -
Automatically constructed multiword lexicon hrMWELex v0.5
The hrMWELex lexicon is an automatically constructed lexicon of Croatian multiword expression candidates (mostly collocations) from the parsed hrWaC 2.0 corpus by using the... -
Automatically constructed multiword lexicon slMWELex v0.5
The slMWELex lexicon is an automatically constructed lexicon of Slovene multiword expression candidates (mostly collocations) from the parsed KRES corpus by using the DepMWEx... -
KRES corpus n-grams 1.0
This is a collection of n-grams extracted from the KRES corpus of written Slovene. In addition to the separate lists of n-grams for tokens and their attributes (morphosyntacic... -
Gos corpus n-grams 1.0
This is a collection of n-grams extracted from the Gos corpus of spoken Slovene. http://hdl.handle.net/11356/1040. In addition to the separate lists of n-grams for tokens and... -
List of formulaic sequences in standard written Slovenian
This document contains 1,891 formulaic sequences in standard written Slovenian, i.e. frequently recurring strings of two to five words, manually annotated for syntactic... -
Janes corpus n-grams 1.0
A collection of n-grams extracted from the Janes corpus of Slovenian user-generated content version 1.0 (cf. http://nl.ijs.si/janes/). Three sets of n-gram lists are provided... -
Multiword Expressions lexicon extracted from the Gigafida 2.1 corpus
The MWE lexicon was extracted from the Gigafida 2.1 Corpus of Written Standard Slovene (https://www.clarin.si/noske/run.cgi/corp_info?corpname=gfida21) using specialized scripts... -
Gos corpus n-grams 2.0
A collection of n-grams extracted from the Gos corpus of spoken Slovene (cf. http://eng.slovenscina.eu/korpusi/gos). Three sets of n-gram lists are provided for lowercased word... -
List of formulaic sequences in spoken Slovenian
This document contains 2,374 formulaic sequences in spoken Slovenian, i.e. frequently recurring strings of two to five words, manually annotated for syntactic structure,... -
Kres corpus n-grams 2.0
A collection of n-grams extracted from the Kres corpus of written Slovene (cf. http://eng.slovenscina.eu/korpusi/kres). Three sets of n-gram lists are provided for lowercased... -
IMP corpus n-grams 1.0
This is a collection of n-grams extracted from the IMP corpus of historical Slovene (http://hdl.handle.net/11356/1031). In addition to the separate lists of n-grams for tokens... -
IMP corpus n-grams 2.0
A collection of n-grams extracted from the IMP corpus of historical Slovene (cf. http://nl.ijs.si/imp/). Three sets of n-gram lists are provided for lowercased word n-grams of... -
Dataset of Slovene idiomatic expressions SloIE
SloIE is a manually labelled dataset of Slovene idiomatic expressions. It contains 29,400 sentences with 75 different expressions that can occur with either a literal or an... -
The Database of Lithuanian multiword expressions
The Database of Lithuanian multiword expressions (MWEs) is freely accessible for online search at: https://resursai.pastovu.vdu.lt/paieska/paprastoji from 2019. It contains... -
Database of Lithuanian Multiword Expressions
Database of Lithuanian multiword expressions (MWE) contains bi-gram and tri-gram MWE that occured in DELFI.lt corpus (http://tekstynas.mwe.lt/) at least 10 times. In the... -
Colloc -- A Tool for Automatic Identification of Multiword Expressions
Colloc -- a tool for automatic identification of multiword expressions (MWE) is freely available for online use at http://resursai.mwe.lt/atpazintuvas. As material for training... -
Annotated corpora and tools of the PARSEME Shared Task on Semi-Supervised Ide...
This multilingual resource contains corpora in which verbal MWEs have been manually annotated, gathered at the occasion of the 1.2 edition of the PARSEME Shared Task on...