Multiword Expressions lexicon extracted from the Gigafida 2.1 corpus

Dataset

PID

The MWE lexicon was extracted from the Gigafida 2.1 Corpus of Written Standard Slovene https://www.clarin.si/ske/#dashboard?corpname=gfida21) using specialized scripts for extracting data from corpora containing syntactic dependency annotations. The lexicon contains 5,242 Multiword Expressions with 12,358 examples from Gigafida 2.1. Each MWE entry (or sense) contains at least one and up to three extracted examples.

MWEs were analysed using the JOS dependency parser system (https://nl.ijs.si/jos/bib/jos-skladnja-navodila.pdf) and were assigned matching syntactic structure IDs. The corpus sentences containing the MWE components and matching syntactic structure features were identified in the corpus and assigned to the corresponding headword or sense.

MWEs variants (or variant senses) are linked with the "senseKey" attribute values, forming a MWE cluster of related variants or variant senses. A sample of MWE headwords also contains manually created sense division with descriptions of meaning for each sense.

Identifier
PID	http://hdl.handle.net/11356/1421
Related Identifier	http://slovnica.ijs.si/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1421

Provenance
Creator	Krek, Simon; Gantar, Apolonija; Laskowski, Cyprian; Krsnik, Luka; Kosem, Iztok; Brank, Janez; Dobrovoljc, Kaja; Arhar Holdt, Špela; Čibej, Jaka; Robnik-Šikonja, Marko; Klemenc, Bojan; Gorjanc, Vojko
Publisher	Centre for Language Resources and Technologies, University of Ljubljana
Publication Year	2021
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); PUB; https://creativecommons.org/licenses/by-sa/4.0/
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	lexicalConceptualResource
Format	application/zip; text/plain; charset=utf-8; downloadable_files_count: 1
Discipline	Linguistics