Frequency lists of word parts from the Gigafida 2.0 corpus

PID

Frequency lists of words split into word parts were extracted from the Gigafida 2.0 Corpus of Written Standard Slovene (https://viri.cjvt.si/gigafida/) using the LIST corpus extraction tool (http://hdl.handle.net/11356/1227). The lists contain all lemmas or lower-case word forms occurring in the corpus, split into their initial or final part (i.e. the initial or final string of 1, 2, 3, 4 or 5 characters in the word) and the rest of the word. In addition, the lists also contain absolute and relative frequencies, percentages, and distribution across the text-types included in the corpus taxonomy.

The lists were extracted for each part-of-speech category. For each part-of-speech, a total of 20 lists were extracted: 1) 10 lists for initial or final word parts extracted from lemmas, 2) 10 lists for initial or final word parts extracted from lower-case word forms.

In addition, 20 lists were extracted from all words (regardless of their part-of-speech category). For easier processing in statistical analysis software, shortened versions of longer lists were made containing the first 150,000 lines.

Identifier
PID http://hdl.handle.net/11356/1275
Related Identifier http://slovnica.ijs.si/
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1275
Provenance
Creator Čibej, Jaka; Arhar Holdt, Špela; Dobrovoljc, Kaja; Krek, Simon
Publisher Centre for Language Resources and Technologies, University of Ljubljana; Jožef Stefan Institute
Publication Year 2019
Rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene
Resource Type lexicalConceptualResource
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 30
Discipline Linguistics