Background data (adapted from Jenset & McGillivray 2017) for: Down-sampling from hierarchically structured corpus data

DOI

Dataset description This dataset, which is adapted from Jenset and McGillivray (2017), contains tabular files documenting the alternating usage of -(e)th and -(e)s to mark third-person verb inflection in Early Modern English. The data provided by Jenset and McGillivray (2017) are drawn from the PPCEME corpus (Kroch et al. 2004) and cover the period from 1500 to 1700. In total, 13,757 third-person singular tokens (excluding the verb BE) were annotated by these authors for a range of variables. For the purposes of the present methodological study, this dataset was reduced to a subset of 11,645 tokens, and the coding of variables was in some parts revised, completed, or modified. The dataset includes information about the Author and Verb Lemma, as well as a number of predictor variables, including Genre, Year, Frequency (of the verb lemma in the third-person singular), Phonological Context (stem-final sound), and the Gender of the author.

Abstract for related publication Resource constraints often force researchers to down-size the list of tokens returned by a corpus query. This paper sketches a methodology for down-sampling and offers a survey of current practices. We build on earlier work and extend the evaluation of down-sampling designs to settings where tokens are clustered by text file and lexeme. Our case study deals with third-person present-tense verb inflection in Early Modern English and focuses on five predictors: Year, Gender, Genre, Frequency, and Phonological Context. We evaluate two strategies for selecting 2,000 (out of 11,645) tokens: simple down-sampling, where each hit has the same selection probability; and structured down-sampling, where this probability is inversely proportional to the author- and verb-specific token count. We form 500 sub-samples using each scheme and compare regression results to a reference model fit to the full set of cases. We observe that structured down-sampling shows better performance on several evaluation criteria.

R, 4.2.1

RStudio, 2023.06.2

Identifier
DOI https://doi.org/10.18710/5KCE4U
Metadata Access https://dataverse.no/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.18710/5KCE4U
Provenance
Creator Sönning, Lukas (ORCID: 0000-0002-2705-395X)
Publisher DataverseNO
Contributor Sönning, Lukas; Alan Turing Institute, University of Cambridge; University of Bamberg; The Tromsø Repository of Language and Linguistics (TROLLing)
Publication Year 2023
Rights info:eu-repo/semantics/openAccess
OpenAccess true
Contact Sönning, Lukas (University of Bamberg)
Representation
Resource Type observational data; Dataset
Format text/plain; text/tsv; application/octet-stream
Size 12381; 2120816; 13462
Version 1.0
Discipline Humanities; Linguistics