A collection of n-grams extracted from the Kres corpus of written Slovene (cf. http://eng.slovenscina.eu/korpusi/kres). Three sets of n-gram lists are provided for lowercased word n-grams of length 1 to 5:
- extensive frequency lists of all extracted n-grams
- filtered frequency lists of n-grams with minimum frequency 10/mil.
- adjusted frequency list of all n-grams with minimum frequency 10/mil.
Only n-grams within sentences have been counted, ignoring punctuation. For the filtered and adjusted list, only n-grams occurring in at least 2 different texts have been extracted.
Key references:
- K. Dobrovoljc, 2018. N-gram frequency lists for reference corpora of Slovenian language. Proceedings of the Language Technologies & Digital Humanities Conference 2018.
- N. Logar Berginc, M. Grčar, M. Brakus, T. Erjavec, Š. Arhar Holdt in S. Krek (2012): Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja, vsebina, uporaba. Ljubljana: Trojina, zavod za uporabno slovenistiko; Fakulteta za družbene vede.
- M. B. O’Donnell, 2010. The adjusted frequency list: A method to produce cluster-sensitive frequency lists. ICAME Journal 35, 135–169.