Slovene instruction-following dataset for large language models GaMS-Instruct-GEN 1.0

Dataset

PID

GaMS-Instruct-GEN is an instruction-following dataset designed to fine-tune Slovene large language models to follow instructions. It consists of pairs of prompts and responses, some of which contain an additional input field.

The dataset was generated automatically using GPT-4 by using 225 manually compiled seed prompts from SelfInstruct (Wang et al. 2022), an instruction-following dataset for English (https://huggingface.co/datasets/yizhongw/self_instruct). The seed prompts were manually translated into Slovene (see "seed_tasks_sl.jsonl") and used as part of a prompt to generate additional similar examples (see 00README.txt for more details).

The automatically generated examples were manually validated by 9 annotators (linguists). Version 1.0 contains only prompt-response pairs that are adequately formatted and free of LLM-hallucinations. Most of the prompt-response pairs deal with general topics (e.g. essay writing, event organization, text corrections, creative tasks), while some deal with Slovene-specific topics (e.g. planning trips around Slovenia, prompts referring to Slovene literature or culture).

Identifier
PID	http://hdl.handle.net/11356/1971
Related Identifier	https://www.cjvt.si/povejmo/en/project/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1971

Provenance
Creator	Vreš, Domen; Arčon, Tjaša; Čibej, Jaka; Robnik-Šikonja, Marko; Krek, Simon; Gabrovšek, Dejan; Ježovnik, Janoš; Kastelic, Maja; Krvina, Domen; Ledinek, Nina; Michelizza, Mija; Perdih, Andrej; Petric Žižić, Špela; Trojar, Mitja
Publisher	Faculty of Computer and Information Science, University of Ljubljana
Publication Year	2024
Rights	Creative Commons - Attribution 4.0 International (CC BY 4.0); PUB; https://creativecommons.org/licenses/by/4.0/
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	corpus
Format	text/plain; charset=utf-8; application/zip; downloadable_files_count: 1
Discipline	Linguistics