Frequency lists of words were extracted from the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040) using the LIST corpus extraction tool (http://hdl.handle.net/11356/1227). The lists contain all words occurring in the corpus along with their absolute and relative frequencies, percentages, and distribution across the text-types included in the corpus taxonomy.
The lists were extracted for each part-of-speech category. For each part-of-speech, two lists were extracted:
1) one containing lemmas and their text-type distribution,
2) one containing lower-case word forms as well as their standardized forms, lemmas, and morphosyntactic tags along with their text-type distribution.
In addition, four lists were extracted from all words (regardless of their part-of-speech category):
1) a list of all lemmas along with their part-of-speech category and text-type distribution;
2) a list of all lower-case word forms with their lemmas, part-of-speech categories, and text-type distribution;
3) a list of all lower-case word forms with their standardized word forms, lemmas, part-of-speech categories, and text-type distribution;
4) a list of all morphosyntactic tags and their text-type distribution (the tags are also split into several columns).
Compared to the previous version (http://hdl.handle.net/11356/1269), this one includes fixes of several typos and substitutes all instances of "normalized forms" with the more adequate term "standardized forms" (as used in the SSJ project).