Slovene Translation of the Atomic 2020 data set SloATOMIC 2020

Dataset

PID

The SloATOMIC 2020 corpus contains the Slovene translations of the ATOMIC 2020 data set, a commonsense knowledge graph with 1.33M everyday inferential knowledge tuples about entities and events. The translations were acquired using the DeepL translation service, where a selection of about 10k examples was also manually inspected and appropriately fixed. The corpus consists of 1.331.114 examples distributed across the train, validation, and test data sets. The corpus was created as part of work package 4 of the Slovene in the Digital Environment project.

The corpus consists of the following files:

- sloatomic_train.tsv: The training set.
- sloatomic_dev.tsv: The validation set.
- sloatomic_test.tsv.automatic_all: The test set containing all of the automatically translated examples.
- sloatomic_test.tsv.automatic_10k: The selection of 10k examples from the complete test set.
- sloatomic_test.tsv.manual_10k: The manually inspected and fixed examples of the automatic 10k subset.

The data is in the tsv (tab-seperated) format. Each line contains one example. The columns are:

- head_event: The head event of the example.
- relation: The relation between the head event and the tail event. The relation can be one of the 23 different descriptors.
- tail_event: The tail event of the example.

Identifier
PID	http://hdl.handle.net/11356/1724
Related Identifier	https://ailab.ijs.si/dunja/SiKDD2022/Papers/SiKDD2022_paper_5674.pdf
Related Identifier	https://github.com/E3-JSI/dataset-SloATOMIC-2020
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1724

Provenance
Creator	Mladenić Grobelnik, Adrian; Novak, Erik; Mladenić, Dunja; Grobelnik, Marko
Publisher	Jožef Stefan Institute
Publication Year	2022
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); PUB; https://creativecommons.org/licenses/by-sa/4.0/
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	corpus
Format	text/plain; charset=utf-8; application/zip; downloadable_files_count: 1
Discipline	Linguistics