Replication data for: The beginning of a beautiful friendship: rule-based and statistical analysis of Middle Russian

DOI

We describe and compare two tools for processing Middle Russian texts. Both tools provide lemmatization, part-of-speech and morphological annotation. One (“RNC”) was developed for annotating texts in the Russian National Corpus and is rule-based. The other one (“TOROT”) is being used for annotating the eponymous corpus and is statistical. We apply the two analyzers to the same Middle Russian text and then compare their outputs with high-quality manual annotation. Since the analyzers use different annotation schemes and spelling principles, we have to harmonize their outputs before we can compare them. The comparison shows that TOROT performs considerably better than RNC (lemmatization 69.8% vs. 47.3%, part of speech 89.5% vs. 54.2%, morphology 81.5% vs. 16.7%). If, however, we limit the evaluation set only to those tokens for which the analyzers provide a guess and in addition consider the RNC response correct if one of the multiple guesses it provides is correct, the numbers become comparable (88.5% vs. 91.9%, 93.9% vs. 95.2%, 81.5% vs. 86.8%). We develop a simple procedure which boosts TOROT lemmatization accuracy by 8.7% by using RNC lemma guesses when TOROT fails to provide one and matching them against the existing TOROT lemma database. We conclude that a statistical analyzer (trained on a large material) can deal with non-standardised historical texts better than a rule-based one. Still, it is possible to make the analyzers collaborate, boosting the performance of the superior one.

Identifier
DOI https://doi.org/10.18710/T9NQ9L
Metadata Access https://dataverse.no/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.18710/T9NQ9L
Provenance
Creator Berdicevskis, Aleksandrs; Eckhoff, Hanne; Gavrilova, Tatjana
Publisher DataverseNO
Contributor Berdicevskis, Aleksandrs; UiT The Arctic University of Norway; National Research University Higher School of Economics; The Tromsø Repository of Language and Linguistics (TROLLing)
Publication Year 2017
Rights CC0 1.0; info:eu-repo/semantics/openAccess; http://creativecommons.org/publicdomain/zero/1.0
OpenAccess true
Contact Berdicevskis, Aleksandrs
Representation
Resource Type Dataset
Format text/plain; application/vnd.lotus-1-2-3; application/octet-stream; text/tab-separated-values; text/csv; text/xml
Size 3460; 18432; 18125; 3612345; 1465051; 1106315; 226550; 452814; 357324; 397238; 23785; 119089; 167037; 597639; 6871; 47913; 233041; 1111; 10772
Version 1.2
Discipline Humanities