Data for: NILK, entity linking dataset targeting NIL-linking cases


A dataset for the NIL-detection and NIL-disambiguation tasks.

The NILK dataset has two main features: 1) It marks NIL-mentions for NIL-detection by extracting mentions which belong to newly added entities in Wikipedia text. 2) It provides an entity label for NIL-disambiguation by marking NIL-mentions with WikiData IDs from the newer dump.

Dataset files contain JSON objects of the following structure:

{"mention":"Walter Damrosch", "offset":348, "length":15, "context":"...the conductor Walter Damrosch. He scored the piece for the standard instruments of the symphony orchestra plus celesta, saxophone, and automobile horns...", "wikipedia_page_id":"309", "wikidata_id":"Q725579", "nil":false}

The dataset contains both linked and not linked mentions, one can distinguish between them by checking "nil" flag. To obtain NIL-mentions, we compared two WikiData dumps: from 2017 and 2021. NIL-mentions have WikiData ID from WikiData 2021, one can use it to check whether these mentions refer to the same entity.

Related Identifier IsCitedBy
Metadata Access
Creator Iurshina, Anastasiia ORCID logo; Pan, Jiaxin ORCID logo; Boutalbi, Rafika ORCID logo; Staab, Steffen ORCID logo
Publisher DaRUS
Contributor Iurshina, Anastasiia
Publication Year 2023
Funding Reference DFG
Rights CC BY 4.0; info:eu-repo/semantics/openAccess;
OpenAccess true
Contact Iurshina, Anastasiia (Universität Stuttgart)
Resource Type Dataset
Format application/octet-stream
Size 11660991447; 9446345420; 9308705399; 9117971657; 8924690144; 8826492642; 8493739663; 8504772937; 8479750863; 8481819381; 8530556439; 6775518920; 11758036808
Version 1.0
Discipline Other