Slovene instruction-following dataset for large language models GaMS-Instruct-GEN 1.0

PID

GaMS-Instruct-GEN is an instruction-following dataset designed to fine-tune Slovene large language models to follow instructions. It consists of pairs of prompts and responses, some of which contain an additional input field.

The dataset was generated automatically using GPT-4 by using 225 manually compiled seed prompts from SelfInstruct (Wang et al. 2022), an instruction-following dataset for English (https://huggingface.co/datasets/yizhongw/self_instruct). The seed prompts were manually translated into Slovene (see "seed_tasks_sl.jsonl") and used as part of a prompt to generate additional similar examples (see 00README.txt for more details).

The automatically generated examples were manually validated by 9 annotators (linguists). Version 1.0 contains only prompt-response pairs that are adequately formatted and free of LLM-hallucinations. Most of the prompt-response pairs deal with general topics (e.g. essay writing, event organization, text corrections, creative tasks), while some deal with Slovene-specific topics (e.g. planning trips around Slovenia, prompts referring to Slovene literature or culture).

Identifier
PID http://hdl.handle.net/11356/1971
Related Identifier https://www.cjvt.si/povejmo/en/project/
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1971
Provenance
Creator Vreš, Domen; Arčon, Tjaša; Čibej, Jaka; Robnik-Šikonja, Marko; Krek, Simon; Gabrovšek, Dejan; Ježovnik, Janoš; Kastelic, Maja; Krvina, Domen; Ledinek, Nina; Michelizza, Mija; Perdih, Andrej; Petric Žižić, Špela; Trojar, Mitja
Publisher Faculty of Computer and Information Science, University of Ljubljana
Publication Year 2024
Rights Creative Commons - Attribution 4.0 International (CC BY 4.0); PUB; https://creativecommons.org/licenses/by/4.0/
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 1
Discipline Linguistics