The X-GENRE dataset comprises almost 3,000 web texts in English and Slovenian, manually-annotated with genre labels. The dataset allows for automated genre identification and genre analyses as well as other web corpora research. Inter alia, it was used for the development of the multilingual X-GENRE classifier (http://hdl.handle.net/11356/1961).
The X-GENRE dataset was constructed by merging three manually-annotated datasets by mapping the original schemata to the joint genre schema (the "X-GENRE schema"): 1) the Slovenian GINCO dataset (http://hdl.handle.net/11356/1467), 2) the English CORE dataset (https://github.com/TurkuNLP/CORE-corpus), and 3) the English FTD dataset (https://github.com/ssharoff/genre-keras). All of the original genre datasets are based on web corpora. The X-GENRE schema consists of 9 genre labels: Information/Explanation, News, Instruction, Opinion/Argumentation, Forum, Prose/Lyrical, Legal, Promotion and Other (refer to the README provided with the files for the details on the labels).
The dataset is separated into train, development and test split. The train split consists of 1,772 texts and 1,940,317 words, the development split of 592 texts and 798,025 words, and the test split of 592 texts and 583,595 words. The splits are stratified by labels. As the dataset consists of two English datasets and one Slovenian dataset, the distribution of texts in the two languages is roughly two to one: 2,063 English texts and 893 Slovenian texts.
The dataset is in JSONL format. It has the following attributes: text (text instance), labels (genre label), dataset (original manually-annotated genre dataset from which the instance was obtained – CORE, GINCO or FTD), and language (language of the text – Slovenian or English).
This work received funding from the European Union’s Connecting Europe Facility 2014–2020 – CEF Telecom – under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the authors’ views. The Agency is not responsible for any use that may be made of the information it contains.