Frequency lists of word parts from the GOS 1.0 corpus


Frequency lists of words split into word parts were extracted from the GOS 1.0 Corpus of Spoken Slovene ( using the LIST corpus extraction tool ( The lists contain all lemmas, lower-case word forms or normalized word forms occurring in the corpus, split into their initial or final part (i.e. the initial or final string of 1, 2, 3, 4 or 5 characters in the word) and the rest. In addition, the lists also contain absolute and relative frequencies, percentages, and distribution across the text-types included in the corpus taxonomy.

The lists were extracted for each part-of-speech category. For each part-of-speech, a total of 30 lists were extracted: 1) 10 lists for initial or final word parts extracted from lemmas, 2) 10 lists for initial or final word parts extracted from lower-case word forms, 3) 10 lists for initial or final word parts extracted from normalized word forms.

In addition, 30 lists were extracted from all words (regardless of their part-of-speech category).

Related Identifier
Related Identifier
Metadata Access
Creator Čibej, Jaka; Arhar Holdt, Špela; Dobrovoljc, Kaja; Krek, Simon
Publisher Centre for Language Resources and Technologies, University of Ljubljana; Jožef Stefan Institute
Publication Year 2019
Rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0);; PUB
OpenAccess true
Contact info(at)
Language Slovenian; Slovene
Resource Type lexicalConceptualResource
Format application/zip; text/plain; charset=utf-8; downloadable_files_count: 1
Discipline Linguistics