Data for: NILK, entity linking dataset targeting NIL-linking cases

DOI

A dataset for the NIL-detection and NIL-disambiguation tasks.

The NILK dataset has two main features: 1) It marks NIL-mentions for NIL-detection by extracting mentions which belong to newly added entities in Wikipedia text. 2) It provides an entity label for NIL-disambiguation by marking NIL-mentions with WikiData IDs from the newer dump.

Dataset files contain JSON objects of the following structure:

{"mention":"Walter Damrosch", "offset":348, "length":15, "context":"...the conductor Walter Damrosch. He scored the piece for the standard instruments of the symphony orchestra plus celesta, saxophone, and automobile horns...", "wikipedia_page_id":"309", "wikidata_id":"Q725579", "nil":false}

The dataset contains both linked and not linked mentions, one can distinguish between them by checking "nil" flag. To obtain NIL-mentions, we compared two WikiData dumps: from 2017 and 2021. NIL-mentions have WikiData ID from WikiData 2021, one can use it to check whether these mentions refer to the same entity.

Identifier
DOI https://doi.org/10.18419/darus-3454
Related Identifier IsCitedBy https://doi.org/10.1145/3511808.3557659
Metadata Access https://darus.uni-stuttgart.de/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.18419/darus-3454
Provenance
Creator Iurshina, Anastasiia ORCID logo; Pan, Jiaxin ORCID logo; Boutalbi, Rafika ORCID logo; Staab, Steffen ORCID logo
Publisher DaRUS
Contributor Iurshina, Anastasiia
Publication Year 2023
Funding Reference DFG
Rights CC BY 4.0; info:eu-repo/semantics/openAccess; http://creativecommons.org/licenses/by/4.0
OpenAccess true
Contact Iurshina, Anastasiia (Universität Stuttgart)
Representation
Resource Type Dataset
Format application/octet-stream
Size 11660991447; 9446345420; 9308705399; 9117971657; 8924690144; 8826492642; 8493739663; 8504772937; 8479750863; 8481819381; 8530556439; 6775518920; 11758036808
Version 1.0
Discipline Other