A dataset for the NIL-detection and NIL-disambiguation tasks.
The NILK dataset has two main features: 1) It marks NIL-mentions for NIL-detection by extracting mentions which belong to newly added entities in Wikipedia text. 2) It provides an entity label for NIL-disambiguation by marking NIL-mentions with WikiData IDs from the newer dump.
Dataset files contain JSON objects of the following structure:
{"mention":"Walter Damrosch",
"offset":348,
"length":15,
"context":"...the conductor Walter Damrosch. He scored the piece for the standard instruments of the symphony orchestra plus celesta, saxophone, and automobile horns...",
"wikipedia_page_id":"309",
"wikidata_id":"Q725579",
"nil":false}
The dataset contains both linked and not linked mentions, one can distinguish between them by checking "nil" flag. To obtain NIL-mentions, we compared two WikiData dumps: from 2017 and 2021. NIL-mentions have WikiData ID from WikiData 2021, one can use it to check whether these mentions refer to the same entity.