Frequency lists of words split into word parts were extracted from the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040) using the LIST corpus extraction tool (http://hdl.handle.net/11356/1227). The lists contain all lemmas, lower-case word forms or standardized word forms occurring in the corpus, split into their initial or final part (i.e. the initial or final string of 1, 2, 3, 4 or 5 characters in the word) and the rest. In addition, the lists also contain absolute and relative frequencies, percentages, and distribution across the text-types included in the corpus taxonomy.
The lists were extracted for each part-of-speech category. For each part-of-speech, a total of 30 lists were extracted:
1) 10 lists for initial or final word parts extracted from lemmas,
2) 10 lists for initial or final word parts extracted from lower-case word forms,
3) 10 lists for initial or final word parts extracted from standardized word forms.
In addition, 30 lists were extracted from all words (regardless of their part-of-speech category).
Compared to the previous version (http://hdl.handle.net/11356/1270), this one includes fixes of several typos and substitutes all instances of "normalized forms" with the more adequate term "standardized forms" (as used in the SSJ project).