Slovenian keyword extraction dataset from SentiNews 1.0

PID

The dataset consists of 7514 Slovenian news articles from the SentiNews 1.0 corpus by Bučar et al. 2017 (http://hdl.handle.net/11356/1110) which had available article keywords. We provide the train and test data splits (5995 articles for training and 1519 for testing) that can be used for keyword extraction experiments. The format is a json file, containing the following fields: title, keywords, lang (always Slovene) and body (with the content of the article). In our paper we addressed keyword extraction in a cross-lingual setting: Koloski, Boshko, et al. "Out of Thin Air: Is Zero-Shot Cross-Lingual Keyword Detection Better Than Unsupervised?." arXiv preprint arXiv:2202.06650 (2022). [https://arxiv.org/pdf/2202.06650.pdf] For reproducing the results, you can use keyword datasets from the dataset http://hdl.handle.net/11356/1403 described in: Koloski, B., Pollak, S., Škrlj, B., & Martinc, M. (2021). Extending Neural Keyword Extraction with TF-IDF tagset matching. In: Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation, Kiev, Ukraine, pages 22–29.

Identifier
PID http://hdl.handle.net/11356/1495
Related Identifier http://candas.ijs.si/
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1495
Provenance
Creator Koloski, Boshko; Martinc, Matej; Tavchioski, Ilija; Škrlj, Blaž; Pollak, Senja
Publisher Jožef Stefan Institute
Publication Year 2022
Funding Reference info:eu-repo/grantAgreement/EC/H2020/825153
Rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene
Resource Type corpus
Format application/gzip; text/plain; charset=utf-8; downloadable_files_count: 2
Discipline Linguistics