Wordlists, keywords and n-grams were extracted from a corpus of textbooks for Slovenian elementary and secondary schools. The corpus contains 4,302,857 words (5,373,268 tokens), and consists of 127 textbooks from 16 different subjects:
- Biology (6 textbooks; 293,935 words),
- State, society and ethics (1 textbook; 21,881 words),
- Society (4 textbooks; 64,126),
- Physics (5 textbooks; 185,171),
- Geography (7 textbooks; 202,101 words),
- Music (8 textbooks; 224,034 words),
- Home Economics (3 textbooks; 33.803),
- Chemistry (7 textbooks; 282,543 words),
- Art (3 textbooks; 146,681),
- Mathematics (23 textbooks; 764,012),
- Science (5 textbooks; 226,191 words),
- Science and technology (6 textbooks; 183,749 words),
- Slovene language (37 textbooks; 1,437,945 words),
- Environmental Education (7 textbooks; 38,645 words),
- Technology (1 textbook; 24,733 words)
- History (4 textbooks; 173,307 words).
The lists were manually cleaned, most items not found in the reference morphological lexicon Sloleks (http://hdl.handle.net/11356/1039) were removed, which mainly consisted of conversion errors.
The lists include only those words, keywords or n-grams that were found in at least 8 different subjects. Keyword lists were extracted using the Sketch Engine tool, minimum frequency was set to 5, the statistics used was average relative frequency. Minimum frequency for n-grams was 10.