Frequency lists of word-level n-grams from the GOS 1.0 corpus

Dataset

PID

Frequency lists of word-level n-grams (or word sets) were extracted from the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040) using the LIST corpus extraction tool (http://hdl.handle.net/11356/1227). The lists contain all word-level 2-, 3-, 4- and 5-grams occurring in the corpus along with their absolute and relative frequencies, percentages, distribution across the text-types included in the corpus taxonomy, and five collocation measures: Dice, t-score, MI, MI3, logDice, and simple LL.

The n-grams were extracted from lower-case word forms, normalized word forms, and morphosyntactic tags.

For large lists, shortened versions with the first 150,000 lines were also prepared to facilitate further processing in spreadsheet analysis software.

Identifier
PID	http://hdl.handle.net/11356/1271
Related Identifier	http://hdl.handle.net/11356/1365
Related Identifier	http://slovnica.ijs.si/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1271

Provenance
Creator	Čibej, Jaka; Arhar Holdt, Špela; Dobrovoljc, Kaja; Krek, Simon
Publisher	Centre for Language Resources and Technologies, University of Ljubljana; Jožef Stefan Institute
Publication Year	2019
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	lexicalConceptualResource
Format	application/zip; text/plain; charset=utf-8; downloadable_files_count: 3
Discipline	Linguistics