News sentiment analysis datasets for Serbian, Bosnian, Macedonian, Albanian and Estonian SADEmma 1.0

Dataset

PID

We provide annotated datasets on a three-point sentiment scale (positive, neutral and negative) for Serbian, Bosnian, Macedonian, Albanian, and Estonian. For all languages except Estonian, we include pairs of source URL (where corresponding text can be found) and sentiment label.

For Estonian, we randomly sampled 100 articles from "Ekspress news article archive (in Estonian and Russian) 1.0" (http://hdl.handle.net/11356/1408).

The data is organized in Tab-Separated Values (TSV) format. For Serbian, Bosnian, Macedonian, and Albanian, the dataset contains two columns: sourceURL and sentiment. For Estonian, the dataset consists of three columns: text ID (from the CLARIN.SI reference above), body text, and sentiment label.

Identifier
PID	http://hdl.handle.net/11356/1987
Related Identifier	https://emma.ijs.si/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1987

Provenance
Creator	Ivačič, Nikola; Pelicon, Andraž; Koloski, Boshko; Pollak, Senja; Purver, Matthew
Publisher	Jožef Stefan Institute
Publication Year	2024
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Bosnian; Serbian; Macedonian; Albanian; Estonian
Resource Type	corpus
Format	text/plain; charset=utf-8; application/octet-stream; downloadable_files_count: 5
Discipline	Linguistics