Slovenian keyword extraction dataset from SentiNews 1.0

Dataset

PID

The dataset consists of 7514 Slovenian news articles from the SentiNews 1.0 corpus by Bučar et al. 2017 (http://hdl.handle.net/11356/1110) which had available article keywords. We provide the train and test data splits (5995 articles for training and 1519 for testing) that can be used for keyword extraction experiments. The format is a json file, containing the following fields: title, keywords, lang (always Slovene) and body (with the content of the article). In our paper we addressed keyword extraction in a cross-lingual setting: Koloski, Boshko, et al. "Out of Thin Air: Is Zero-Shot Cross-Lingual Keyword Detection Better Than Unsupervised?." arXiv preprint arXiv:2202.06650 (2022). [https://arxiv.org/pdf/2202.06650.pdf] For reproducing the results, you can use keyword datasets from the dataset http://hdl.handle.net/11356/1403 described in: Koloski, B., Pollak, S., Škrlj, B., & Martinc, M. (2021). Extending Neural Keyword Extraction with TF-IDF tagset matching. In: Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation, Kiev, Ukraine, pages 22–29.

Identifier
PID	http://hdl.handle.net/11356/1495
Related Identifier	http://candas.ijs.si/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1495

Provenance
Creator	Koloski, Boshko; Martinc, Matej; Tavchioski, Ilija; Škrlj, Blaž; Pollak, Senja
Publisher	Jožef Stefan Institute
Publication Year	2022
Funding Reference	info:eu-repo/grantAgreement/EC/H2020/825153
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	corpus
Format	application/gzip; text/plain; charset=utf-8; downloadable_files_count: 2
Discipline	Linguistics