Frequency lists of word parts from the Gigafida 2.0 corpus

Dataset

PID

Frequency lists of words split into word parts were extracted from the Gigafida 2.0 Corpus of Written Standard Slovene (https://viri.cjvt.si/gigafida/) using the LIST corpus extraction tool (http://hdl.handle.net/11356/1227). The lists contain all lemmas or lower-case word forms occurring in the corpus, split into their initial or final part (i.e. the initial or final string of 1, 2, 3, 4 or 5 characters in the word) and the rest of the word. In addition, the lists also contain absolute and relative frequencies, percentages, and distribution across the text-types included in the corpus taxonomy.

The lists were extracted for each part-of-speech category. For each part-of-speech, a total of 20 lists were extracted: 1) 10 lists for initial or final word parts extracted from lemmas, 2) 10 lists for initial or final word parts extracted from lower-case word forms.

In addition, 20 lists were extracted from all words (regardless of their part-of-speech category). For easier processing in statistical analysis software, shortened versions of longer lists were made containing the first 150,000 lines.

Identifier
PID	http://hdl.handle.net/11356/1275
Related Identifier	http://slovnica.ijs.si/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1275

Provenance
Creator	Čibej, Jaka; Arhar Holdt, Špela; Dobrovoljc, Kaja; Krek, Simon
Publisher	Centre for Language Resources and Technologies, University of Ljubljana; Jožef Stefan Institute
Publication Year	2019
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	lexicalConceptualResource
Format	text/plain; charset=utf-8; application/zip; downloadable_files_count: 30
Discipline	Linguistics