Natural Language 2 Semantic Hypergraph Dataset NL2SH 1.0

Dataset

PID

NL2SH (Natural Language to Semantic Hypergraph) dataset can be used to build and evaluate methods for knowledge extraction and representation based on a semantic hypergraph. Each sentence has natural language annotations and dedicated semantic hyperedge. Majority of the sentences used in this dataset are taken from the following sources: * John Eastwood, Oxford Guide to English Grammar, Oxford University Press, 2002. * Andrew Redford, An Introduction to English Sentence Structure, Cambridge University Press, 2009. * Essential English Grammar, Philip Gucker, Dover Publications, Inc. New York, 1966

Natural language annotations are: * sent_i - id of the sentence * tok_i - id of the token in the sentence * word - token text * space - does space follows the token * lemma - lemma of the token * pos - Universal POS tags (https://universaldependencies.org/u/pos/) * tag - Penn Treebank tags (https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) * dep - ClearNLP depedency labels (https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md) * head - id of the token which is a dependency head of the current token * ner - named entities (https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf) * roleset - roleset of a verb frame (https://propbank.github.io/v3.4.0/frames/) * srl - semantic role labels with IOB annotation (https://verbs.colorado.edu/propbank/EPB-Annotation-Guidelines.pdf) * coref - coreference labels with IOB annotation * synset - WordNet's synsets (https://wordnet.princeton.edu)

The annotations for semantic hypergraph elements primarily adhere to the annotation guidelines of the Graphbrain project (https://graphbrain.net/manual/notation.html). However, atom annotations are modified and at the end contains: * label, * type and optional subtype, * type specific atom roles, * type specific additional information, * named entity

Identifier
PID	http://hdl.handle.net/11356/1822
Related Identifier	https://www.acnltutor.net
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1822

Provenance
Creator	Žitko, Branko; Gašpar, Angelina; Bročić, Lucija; Vasić, Daniel
Publisher	Faculty of Science University of Split
Publication Year	2023
Rights	CLARIN.SI Licence ACA ID-BY-NC-INF-NORED 1.0; https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0; ACA
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	English
Resource Type	corpus
Format	text/plain; charset=utf-8; text/plain; downloadable_files_count: 1
Discipline	Linguistics