Kres corpus n-grams 2.0

Dataset

PID

A collection of n-grams extracted from the Kres corpus of written Slovene (cf. http://eng.slovenscina.eu/korpusi/kres). Three sets of n-gram lists are provided for lowercased word n-grams of length 1 to 5: - extensive frequency lists of all extracted n-grams - filtered frequency lists of n-grams with minimum frequency 10/mil. - adjusted frequency list of all n-grams with minimum frequency 10/mil.

Only n-grams within sentences have been counted, ignoring punctuation. For the filtered and adjusted list, only n-grams occurring in at least 2 different texts have been extracted.

Key references: - K. Dobrovoljc, 2018. N-gram frequency lists for reference corpora of Slovenian language. Proceedings of the Language Technologies & Digital Humanities Conference 2018. - N. Logar Berginc, M. Grčar, M. Brakus, T. Erjavec, Š. Arhar Holdt in S. Krek (2012): Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja, vsebina, uporaba. Ljubljana: Trojina, zavod za uporabno slovenistiko; Fakulteta za družbene vede. - M. B. O’Donnell, 2010. The adjusted frequency list: A method to produce cluster-sensitive frequency lists. ICAME Journal 35, 135–169.

Identifier
PID	http://hdl.handle.net/11356/1193
Related Identifier	http://www.sdjt.si/wp/wp-content/uploads/2018/09/JTDH-2018_Dobrovoljc-K_Frekvencni-seznami-n-gramov-v-korpusih-slovenskega-jezika.pdf
Related Identifier	http://hdl.handle.net/11356/1045
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1193

Provenance
Creator	Dobrovoljc, Kaja
Publisher	Centre for Language Resources and Technologies, University of Ljubljana
Publication Year	2018
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	lexicalConceptualResource
Format	application/zip; text/plain; charset=utf-8; downloadable_files_count: 3
Discipline	Linguistics