NL2SH (Natural Language to Semantic Hypergraph) dataset can be used to build and evaluate methods for knowledge extraction and representation based on a semantic hypergraph. Each sentence has natural language annotations and dedicated semantic hyperedge. Majority of the sentences used in this dataset are taken from the following sources:
* John Eastwood, Oxford Guide to English Grammar, Oxford University Press, 2002.
* Andrew Redford, An Introduction to English Sentence Structure, Cambridge University Press, 2009.
* Essential English Grammar, Philip Gucker, Dover Publications, Inc. New York, 1966
Natural language annotations are:
* sent_i - id of the sentence
* tok_i - id of the token in the sentence
* word - token text
* space - does space follows the token
* lemma - lemma of the token
* pos - Universal POS tags (https://universaldependencies.org/u/pos/)
* tag - Penn Treebank tags (https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)
* dep - ClearNLP depedency labels (https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md)
* head - id of the token which is a dependency head of the current token
* ner - named entities (https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf)
* roleset - roleset of a verb frame (https://propbank.github.io/v3.4.0/frames/)
* srl - semantic role labels with IOB annotation (https://verbs.colorado.edu/propbank/EPB-Annotation-Guidelines.pdf)
* coref - coreference labels with IOB annotation * synset - WordNet's synsets (https://wordnet.princeton.edu)
The annotations for semantic hypergraph elements primarily adhere to the annotation guidelines of the Graphbrain project (https://graphbrain.net/manual/notation.html). However, atom annotations are modified and at the end contains:
* label,
* type and optional subtype,
* type specific atom roles,
* type specific additional information,
* named entity