Croatian SenseGraph 1.0

Dataset

PID

SenseGraph a graph-like structure of word senses of most common words of the standard Croatian language, obtained by relying on human-provided lexical substitutes for target words in context. SenseGraph is encoded in the Lexical Markup Framework (LMF; ISO 24613:2008) format.

SenseGraphs consists of SenseCells, which are clusters of same-sense words obtained by grouping of words based on the similarity of their lexical substitution sets and the contexts they appear in. SenseCells can be thought of as Synsets in standard computational lexicographic terminology, albeit they exhibit more variability, which can be attributed to sense modulations in specific contexts. SenseCells are linked to each other based on loose semantic relatedness.

In total, the resource covers 649 Croatian words across three different part-of-speech tags: nouns, verbs, and adjectives. More specifically, the resource contains 4,172 sentences across 230 nouns, 3,288 sentences across 200 verbs, and 4,116 sentences across 219 adjectives. Those sentences were then clustered using a lexical-substitution-based clustering method, yielding 2,877 synsets. The sentences were sampled from the SETimes.HR and hrWaC corpora.

Total number of sentences: 11,576 Total number of syncells: 2,877 Total number of words: 649

Identifier
PID	http://hdl.handle.net/11356/1218
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1218

Provenance
Creator	Šnajder, Jan; Alagić, Domagoj
Publisher	Faculty of Electrical Engineering and Computing, University of Zagreb
Publication Year	2018
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Croatian
Resource Type	lexicalConceptualResource
Format	application/zip; text/plain; charset=utf-8; downloadable_files_count: 1
Discipline	Linguistics