Keywords and n-grams from a textbook corpus

Dataset

PID

Wordlists, keywords and n-grams were extracted from a corpus of textbooks for Slovenian elementary and secondary schools. The corpus contains 4,302,857 words (5,373,268 tokens), and consists of 127 textbooks from 16 different subjects: - Biology (6 textbooks; 293,935 words), - State, society and ethics (1 textbook; 21,881 words), - Society (4 textbooks; 64,126), - Physics (5 textbooks; 185,171), - Geography (7 textbooks; 202,101 words), - Music (8 textbooks; 224,034 words), - Home Economics (3 textbooks; 33.803), - Chemistry (7 textbooks; 282,543 words), - Art (3 textbooks; 146,681), - Mathematics (23 textbooks; 764,012), - Science (5 textbooks; 226,191 words), - Science and technology (6 textbooks; 183,749 words), - Slovene language (37 textbooks; 1,437,945 words), - Environmental Education (7 textbooks; 38,645 words), - Technology (1 textbook; 24,733 words) - History (4 textbooks; 173,307 words).

The lists were manually cleaned, most items not found in the reference morphological lexicon Sloleks (http://hdl.handle.net/11356/1039) were removed, which mainly consisted of conversion errors.

The lists include only those words, keywords or n-grams that were found in at least 8 different subjects. Keyword lists were extracted using the Sketch Engine tool, minimum frequency was set to 5, the statistics used was average relative frequency. Minimum frequency for n-grams was 10.

Identifier
PID	http://hdl.handle.net/11356/1215
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1215

Provenance
Creator	Kosem, Iztok; Pori, Eva; Arhar Holdt, Špela
Publisher	Centre for Language Resources and Technologies, University of Ljubljana
Publication Year	2019
Rights	Creative Commons - Attribution 4.0 International (CC BY 4.0); https://creativecommons.org/licenses/by/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	lexicalConceptualResource
Format	application/zip; text/plain; charset=utf-8; downloadable_files_count: 1
Discipline	Linguistics