Lists of Slovene accentuated units SNES 1.0

Dataset

PID

SNES (Stalno naglašene enote iz Sloleksa; Constantly accentuated units from Sloleks) is a dataset containing Slovene final accentuated word parts (i.e., the ending part of an accentuated word from its last grapheme with an accentuation diacritic to the end of the word; for instance, -álnik for "računálnik", -úlja for "hodúlja") that have been automatically extracted from the accentuated forms of the approximately 100,800 manually validated lexemes of Sloleks 3.0 (http://hdl.handle.net/11356/1745). The extracted parts were then manually categorized to compile a manually validated machine-readable list of final accentuated word parts that are always or almost always accentuated in Slovene (e.g. -álnik, -ílnik). Only accentuated word parts that are accentuated in at least 80% of examples were included in the manual list. The list can be used as a resource in post-processing to correct some of the errors in the output of Slovene accentuation models.

Version 1.0 includes 24,188 automatically extracted final accentuated word parts, 1,013 of which have been manually validated, categorized, and included in a separate manual list of Slovene final word parts that are always or very frequently accentuated. For more details on the structure of the files, please consult 00README.txt.

Identifier
PID	http://hdl.handle.net/11356/2003
Related Identifier	https://doi.org/10.4312/jis.70.3.109-124
Related Identifier	https://mezzanine.um.si/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/2003

Provenance
Creator	Čibej, Jaka
Publisher	Centre for Language Resources and Technologies, University of Ljubljana; Faculty of Arts, University of Ljubljana; Jožef Stefan Institute
Publication Year	2024
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	lexicalConceptualResource
Format	application/zip; text/plain; charset=utf-8; downloadable_files_count: 1
Discipline	Linguistics