Slovenian Definition Extraction training dataset DF_NDF_wiki_slo 1.0

Dataset

PID

The Slovenian definition extraction training dataset DF_NDF_wiki_slo contains 38613 sentences extracted from the Slovenian Wikipedia. The first sentence of a term's description on Wikipedia is considered a definition, and all other sentences are considered non-definitions.

The corpus consists of the following files each containing one definition / non-definition sentence per line:

Definitions: df_ndf_wiki_slo_Y.txt with 3251 definition sentences.
Non-definitions: df_ndf_wiki_slo_N.txt with 14678 non-definition sentences which do not contain the term at the beginning of the sentence.
Non-definitions: df_ndf_wiki_slo_N1.txt with 20684 non-definition sentences which may also contain the term at the beginning of the sentence.

The dataset is described in more detail in Fišer et al. 2010. If you use this resource, please cite:

Fišer, D., Pollak, S., Vintar, Š. (2010). Learning to Mine Definitions from Slovene Structured and Unstructured Knowledge-Rich Resources. Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10). https://aclanthology.org/L10-1089/

Reference to training Transformer-based definition extraction models using this dataset: Tran, T.H.H., Podpečan, V., Jemec Tomazin, M., Pollak, Senja (2023). Definition Extraction for Slovene: Patterns, Transformer Classifiers and ChatGPT. Proceedings of the ELEX 2023: Electronic lexicography in the 21st century. Invisible lexicography: everywhere lexical data is used without users realizing they make use of a “dictionary”.

Related resources: Jemec Tomazin, M. et al. (2023). Slovenian Definition Extraction evaluation datasets RSDO-def 1.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1841

Identifier
PID	http://hdl.handle.net/11356/1840
Related Identifier	https://aclanthology.org/L10-1089/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1840

Provenance
Creator	Podpečan, Vid; Pollak, Senja; Fišer, Darja; Vintar, Špela; Tran, Thi Hong Hanh
Publisher	Jožef Stefan Institute; Faculty of Arts, University of Ljubljana
Publication Year	2023
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	corpus
Format	text/plain; charset=utf-8; text/plain; downloadable_files_count: 3
Discipline	Linguistics