Replication Data for: Predicting Stress in Russian using Modern Machine-Learning Tools

DOI

This dataset consists of a TSV file with five columns of data originating in Zaliznyak's Grammar and Dictionary (1977). The data was programmatically scraped from Giella project data (Moshagen et al., 2013) by Spektor (2021). From Spektor (2021), the data was one of four sources in their RusLex application. Once scraped from there, only symbols were removed. The Russian word data is preserved from the original in Cyrillic. The last column contains abbreviated morphological features in English (e.g. "V" for verb, "N" for noun, "Fem" for feminine, "Cmpr" for comparative, "Impf" for imperfect). The often many features are separated by semicolons. Stress codes were derived for each word that represented stress placement: If the stressed vowel was at the end of the word a stress code of 0 signifying oxytone stress was assigned. Next, counting from the end of the word, the penultimate stress was given a 1, meaning a stress on the paroxytone. Next, if the antepenultimate syllable contained the stress, the word was assigned a 2, meaning a stress on the proparoxytone. The script continued until a stress code was assigned with the following exceptions: a -1 is assigned for those words without explicit stress markers. The columns in the resultant TSV are: the word without stress markers, the word with stress markers, the derived stress code, the lemma, and all morphological features. The dataset contains over 300,000 words from Zaliznyak (1977) with many repeated words that have unique morphological features. Please see the paper for a full description of the dataset.

References: Moshagen, Sjur N., Tommi Pirinen, and Trond Trosterud. (2013). Building an open-source development infrastructure for language technology projects. In Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013), (pp. 343–352). Spektor, Y. (2021). Detection and morphological analysis of novel Russian loanwords (Master’s thesis, CUNY Graduate Center, New York, NY). Retrieved from https://academicworks.cuny.edu/gc_etds/4572/ Zaliznyak, A.A. (1977). Grammatičeskij slovar’ russkogo jazyka. Slovoizmenenie [A grammatical dictionary of Russian: Inflection]. Moscow: Russkij jazyk

Python, 3.10.6

RusLex, 1.0

Zaliznyak, A.A. (1977). Grammatičeskij slovar’ russkogo jazyka. Slovoizmenenie [A grammatical dictionary of Russian: Inflection]. Moscow: Russkij jazyk

The documentation from RusLex is quite thorough in the application itself; the code is well-commented.

I did not look at the Giella data specifically but rather only used the Spektor data sourced from Giella.

This dataset is the Zaliznyak data from Spektor with only symbols removed. Otherwise, the data was pulled directly and not modified in content. The data was organized into 4 columns and another column for the stress codes I derived.

Identifier
DOI https://doi.org/10.18710/AAFCJP
Related Identifier IsCitedBy https://academicworks.cuny.edu/gc_etds/4974
Metadata Access https://dataverse.no/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.18710/AAFCJP
Provenance
Creator Schriner, John ORCID logo
Publisher DataverseNO
Contributor Schriner, John; NYU School of Law; Spektor, Yulia; The Tromsø Repository of Language and Linguistics (TROLLing)
Publication Year 2025
Rights CC0 1.0; info:eu-repo/semantics/openAccess; http://creativecommons.org/publicdomain/zero/1.0
OpenAccess true
Contact Schriner, John (NYU School of Law)
Representation
Resource Type textual data; Dataset
Format text/plain; text/tsv
Size 7294; 30201196
Version 1.0
Discipline Humanities