Frequency lists of collocations from the Gigafida 2.1 corpus

Dataset

PID

Frequency lists of collocations were extracted from the Gigafida 2.1 Corpus of Written Standard Slovene (https://www.clarin.si/ske/#dashboard?corpname=gfida21) using specialised scripts for extraction of data from syntactically parsed corpora.

The lists contain collocations with absolute frequency 10 and above, split into files corresponding to 81 predefined syntactic structures. The formal description of syntactic structures with information on restrictions and representations applied to POS and dependency parsing annotations is included in the dataset.

The lists are sorted according to absolute frequency of collocations and include frequency information on individual lemmas, together with the most frequent representative forms of combined lemmas. The lists also include calculation of logDice score for collocations, and the number of distinct forms of lemmas appearing in corpus hits for a particular collocation.

Identifier
PID	http://hdl.handle.net/11356/1415
Related Identifier	http://slovnica.ijs.si/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1415

Provenance
Creator	Krek, Simon; Gantar, Polona; Kosem, Iztok; Dobrovoljc, Kaja; Arhar Holdt, Špela; Čibej, Jaka; Laskowski, Cyprian; Klemenc, Bojan; Krsnik, Luka
Publisher	Centre for Language Resources and Technologies, University of Ljubljana
Publication Year	2021
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); PUB; https://creativecommons.org/licenses/by-sa/4.0/
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	lexicalConceptualResource
Format	application/zip; text/plain; charset=utf-8; downloadable_files_count: 1
Discipline	Linguistics