Replication data for: Big data in Russian linguistics? Another look at paucal constructions

Dataset

DOI

This post contains a database of Russian numeral constructions from the RuTenTen corpus (https://www.sketchengine.co.uk/rutenten-russian-corpus/). The constructions are of the following type: paucal numeral (2, 3 or 4) followed by an adjective and a feminine noun.

Abstract:

With the advent of large web-based corpora, Russian linguistics steps into the era of “big data”. But how useful are large datasets in our field? What are the advantages? Which problems arise? The present study seeks to shed light on these questions based on an investigation of the Russian paucal construction in the RuTenTen corpus, a web-based corpus with more than ten billion words. The focus is on the choice between adjectives in the nominative (dve/tri/četyre starye knigi) and genitive (dve/tri/četyre staryx knigi) in paucal constructions with the numerals dve, tri or četyre and a feminine noun. Three generalizations emerge. First, the large RuTenTen dataset enables us to identify predictors that could not be explored in smaller corpora. In particular, it is shown that predicates, modifiers, prepositions and word-order affect the case of the adjective. Second, we identify situations where the RuTenTen data cannot be straightforwardly reconciled with findings from earlier studies or there appear to be discrepancies between different statistical models. In such cases, further research is called for. The effect of the numeral (dve, tri vs. četyre) and verbal government are relevant examples. Third, it is shown that adjectives in the nominative have more easily learnable predictors that cover larger classes of examples and show clearer preferences for the relevant case. It is therefore suggested that nominative adjectives have the potential to outcompete adjectives in the genitive over time. Although these three generalizations are valuable additions to our knowledge of Russian paucal constructions, three problems arise. Large internet-based corpora like the RuTenTen corpus (a) are not balanced, (b) involve a certain amount of “noise”, and (c) do not provide metadata. As a consequence of this, it is argued, it may be wise to exercise some caution with regard to conclusions based on “big data”.

Identifier
DOI	https://doi.org/10.18710/DG75YC
Related Identifier	IsCitedBy https://doi.org/10.1515/slaw-2019-0012
Metadata Access	https://dataverse.no/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.18710/DG75YC

Provenance
Creator	Nesset, Tore
Publisher	DataverseNO
Contributor	Nesset, Tore; UiT The Arctic University of Norway; Berdicevskis, Aleksandrs; The Tromsø Repository of Language and Linguistics (TROLLing)
Publication Year	2018
Rights	CC0 1.0; info:eu-repo/semantics/openAccess; http://creativecommons.org/publicdomain/zero/1.0
OpenAccess	true
Contact	Nesset, Tore (UiT The Arctic University of Norway)

Representation
Resource Type	Corpus data; Dataset
Format	text/plain; text/csv
Size	12769; 42585725
Version	1.2
Discipline	Humanities; Linguistics