Multilingual IPTC Media Topic dataset EMMediaTopic 1.0

Dataset

PID

The multilingual IPTC Media Topic dataset EMMediaTopic 1.0 is a collection of news articles in Catalan, Croatian, Greek, and Slovenian, automatically annotated with the 17 top-level topic labels from the IPTC NewsCodes Media Topic hierarchical schema. The texts were annotated by the GPT-4o large language model, accessed via the OpenAI API (https://openai.com/index/hello-gpt-4o/). Evaluation against a manually-annotated test set showed that the model consistently achieves high performance, with an average macro-F1 score of 0.731 and a micro-F1 score of 0.722. Additionally, assessments of inter-annotator agreement on the test set revealed that the reliability of the GPT model used as a data annotator is comparable to that of human annotators.

The EMMediaTopic dataset consists of 21,000 texts, divided into a training (20,000 instances) and a development set (1,000 instances), both of which have an identical distribution of labels. The dataset comprises news articles from the Catalan (ca), Croatian (hr), Greek (el), and Slovenian (sl) MaCoCu-Genre corpora (http://hdl.handle.net/11356/1969). For each language, a random sample of 5,250 texts classified under the "News" genre was extracted from the web corpus. Due to the limitations of the XLM-RoBERTa model fine-tuned on this dataset, the texts were truncated to the first 512 words.

The dataset employs the following 17 top-level IPTC NewsCodes Media Topic (https://cv.iptc.org/newscodes/mediatopic) labels: 'arts, culture, entertainment and media', 'conflict, war and peace', 'crime, law and justice', 'disaster, accident and emergency incident', 'economy, business and finance', 'education', 'environment', 'health', 'human interest', 'labour', 'lifestyle and leisure', 'politics', 'religion', 'science and technology', 'society', 'sport', and 'weather'.

The EMMediaTopic dataset is provided in JSONL format, where each text is accompanied by the following metadata: document_id (document ID from the MaCoCu-Genre corpus), lang (language code: ca, el, hr, or sl), GPT-IPTC-label (GPT-assigned IPTC topic label), and split (train or dev).

This dataset was used for the development of the Multilingual IPTC news topic classifier (https://huggingface.co/classla/multilingual-IPTC-news-topic-classifier), a fine-tuned Transformer-based XLM-RoBERTa model that can be applied to any of the languages included in the XLM-RoBERTa pretraining dataset.

Identifier
PID	http://hdl.handle.net/11356/1991
Related Identifier	https://doi.org/10.48550/arXiv.2411.19638
Related Identifier	https://emma.ijs.si/en/project-plans/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1991

Provenance
Creator	Kuzman, Taja; Ljubešić, Nikola
Publisher	Jožef Stefan Institute
Publication Year	2024
Funding Reference	info:eu-repo/grantAgreement/EC/HE/101129751
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Croatian; Slovenian; Slovene; Greek, Modern (1453-); Greek; Catalan; Valencian
Resource Type	corpus
Format	text/plain; charset=utf-8; application/octet-stream; downloadable_files_count: 1
Discipline	Linguistics