Frequency lists of words were extracted from the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040) using the LIST corpus extraction tool (http://hdl.handle.net/11356/1227). The lists contain all words occurring in the corpus along with their absolute and relative frequencies, percentages, and distribution across the text-types included in the corpus taxonomy.
The lists were extracted for each part-of-speech category. For each part-of-speech, two lists were extracted:
1) one containing lemmas and their text-type distribution,
2) one containing lower-case word forms as well as their normalized forms, lemmas, and morphosyntactic tags along with their text-type distribution.
In addition, four lists were extracted from all words (regardless of their part-of-speech category):
1) a list of all lemmas along with their part-of-speech category and text-type distribution;
2) a list of all lower-case word forms with their lemmas, part-of-speech categories, and text-type distribution;
3) a list of all lower-case word forms with their normalized word forms, lemmas, part-of-speech categories, and text-type distribution;
4) a list of all morphosyntactic tags and their text-type distribution (the tags are also split into several columns).