Background data for: Advancing our understanding of dispersion measures in corpus research

Dataset

DOI

Dataset description This dataset contains background data and supplementary material for Sönning (forthcoming), a study that looks at the behavior of dispersion measures when applied to text-level frequency data. For the literature survey reported in that study, which examines how dispersion measures are used in corpus-based work, it includes tabular files listing the 730 research articles that were examined as well as annotations for those studies that measured dispersion in the corpus-linguistic (and lexicographic) sense. As for the corpus data that were used to train the statistical model parameters underlying the simulation study reported in that paper, the dataset contains a term-document matrix for the 49,604 unique word forms (after conversion to lower-case) that occur in the Brown Corpus. Further, R scripts are included that document in detail how the Brown Corpus XML files, which are available from the Natural Language Toolkit (Bird et al. 2009; https://www.nltk.org/), were processed to produce this data arrangement.

Abstract: Related publication This paper offers a survey of recent corpus-based work, which shows that dispersion is typically measured across the text files in a corpus. Systematic insights into the behavior of measures in such distributional settings are currently lacking, however. After a thorough discussion of six prominent indices, we investigate their behavior on relevant frequency distributions, which are designed to mimic actual corpus data. Our evaluation considers different distributional settings, i.e. various combinations of frequency and dispersion values. The primary focus is on the response of measures to relatively high and low sub-frequencies, i.e. texts in which the item or structure of interest is over- or underrepresented (if not absent). We develop a simple method for constructing sensitivity profiles, which allow us to draw instructive comparisons among measures. We observe that these profiles vary considerably across distributional settings. While D and DP appear to show the most balanced response contours, our findings suggest that much work remains to be done to understand the performance of measures on items with normalized frequencies below 100 per million words.

MAXQDA Plus, 22.5.0

R, 4.2.1

Identifier
DOI	https://doi.org/10.18710/FVHTFM
Related Identifier	IsCitedBy https://doi.org/10.3366/cor.2025.0326
Metadata Access	https://dataverse.no/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.18710/FVHTFM

Provenance
Creator	Sönning, Lukas (ORCID: 0000-0002-2705-395X)
Publisher	DataverseNO
Contributor	Sönning, Lukas; University of Bamberg; The Tromsø Repository of Language and Linguistics (TROLLing)
Publication Year	2024
Rights	info:eu-repo/semantics/openAccess
OpenAccess	true
Contact	Sönning, Lukas (University of Bamberg)

Representation
Resource Type	textual linguistic data; Dataset
Format	text/plain; text/tsv; application/octet-stream
Size	15220; 48718; 4972; 50076558; 50076560; 6290
Version	1.1
Discipline	Design; Fine Arts, Music, Theatre and Media Studies; Humanities; Linguistics
Spatial Coverage	Bamberg, Germany