Frequency lists of collocations from the Gigafida 2.1 corpus


Frequency lists of collocations were extracted from the Gigafida 2.1 Corpus of Written Standard Slovene ( using specialised scripts for extraction of data from syntactically parsed corpora.

The lists contain collocations with absolute frequency 10 and above, split into files corresponding to 81 predefined syntactic structures. The formal description of syntactic structures with information on restrictions and representations applied to POS and dependency parsing annotations is included in the dataset.

The lists are sorted according to absolute frequency of collocations and include frequency information on individual lemmas, together with the most frequent representative forms of combined lemmas. The lists also include calculation of logDice score for collocations, and the number of distinct forms of lemmas appearing in corpus hits for a particular collocation.

Related Identifier
Metadata Access
Creator Krek, Simon; Gantar, Polona; Kosem, Iztok; Dobrovoljc, Kaja; Arhar Holdt, Špela; Čibej, Jaka; Laskowski, Cyprian; Klemenc, Bojan; Krsnik, Luka
Publisher Centre for Language Resources and Technologies, University of Ljubljana
Publication Year 2021
Rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0);; PUB
OpenAccess true
Contact info(at)
Language Slovenian; Slovene
Resource Type lexicalConceptualResource
Format application/zip; text/plain; charset=utf-8; downloadable_files_count: 1
Discipline Linguistics