Frequency lists of words split into word parts were extracted from the Gigafida 2.0 Corpus of Written Standard Slovene (https://viri.cjvt.si/gigafida/) using the LIST corpus extraction tool (http://hdl.handle.net/11356/1227). The lists contain all lemmas or lower-case word forms occurring in the corpus, split into their initial or final part (i.e. the initial or final string of 1, 2, 3, 4 or 5 characters in the word) and the rest of the word. In addition, the lists also contain absolute and relative frequencies, percentages, and distribution across the text-types included in the corpus taxonomy.
The lists were extracted for each part-of-speech category. For each part-of-speech, a total of 20 lists were extracted:
1) 10 lists for initial or final word parts extracted from lemmas,
2) 10 lists for initial or final word parts extracted from lower-case word forms.
In addition, 20 lists were extracted from all words (regardless of their part-of-speech category). For easier processing in statistical analysis software, shortened versions of longer lists were made containing the first 150,000 lines.