Frequency lists of words from the GOS 1.0 corpus 1.1

Dataset

PID

Frequency lists of words were extracted from the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040) using the LIST corpus extraction tool (http://hdl.handle.net/11356/1227). The lists contain all words occurring in the corpus along with their absolute and relative frequencies, percentages, and distribution across the text-types included in the corpus taxonomy.

The lists were extracted for each part-of-speech category. For each part-of-speech, two lists were extracted: 1) one containing lemmas and their text-type distribution, 2) one containing lower-case word forms as well as their standardized forms, lemmas, and morphosyntactic tags along with their text-type distribution.

In addition, four lists were extracted from all words (regardless of their part-of-speech category): 1) a list of all lemmas along with their part-of-speech category and text-type distribution; 2) a list of all lower-case word forms with their lemmas, part-of-speech categories, and text-type distribution; 3) a list of all lower-case word forms with their standardized word forms, lemmas, part-of-speech categories, and text-type distribution; 4) a list of all morphosyntactic tags and their text-type distribution (the tags are also split into several columns).

Compared to the previous version (http://hdl.handle.net/11356/1269), this one includes fixes of several typos and substitutes all instances of "normalized forms" with the more adequate term "standardized forms" (as used in the SSJ project).

Identifier
PID	http://hdl.handle.net/11356/1364
Related Identifier	http://hdl.handle.net/11356/1269
Related Identifier	http://slovnica.ijs.si/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1364

Provenance
Creator	Čibej, Jaka; Arhar Holdt, Špela; Dobrovoljc, Kaja; Krek, Simon
Publisher	Centre for Language Resources and Technologies, University of Ljubljana; Jožef Stefan Institute
Publication Year	2020
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	lexicalConceptualResource
Format	text/plain; charset=utf-8; application/zip; downloadable_files_count: 1
Discipline	Linguistics