Background Data for: What is a Chinese word? Lexical constructionalization in Chinese

DOI

Word is commonly assumed to be the basic linguistic unit, but its definition has actually been controversial in Chinese. The Chinese language is documented in Chinese characters, with no spaces between words: for Chinese, the inherent and (relatively) stable unit is the Chinese character, but not words. The study looks at the essence of Chinese words, i.e., how Chinese words are formed from Chinese characters.

Experiment Data are from Chinese native speakers' word segmentation test.

Correlation Analysis is the correlation between results from the word segmentation test and the conventionality index of Modern Mandarin.

Besides the conventionality index of Modern Mandarin, the conventionality of frequent two-character combinations in Zuozhuan, Shishuoxinyu, Bianwen, Nogeoldae & Bak Tongsa, and Sanyanerpai are also presented in separate files. Article abstract: The Chinese language is defined on the basis of Chinese characters, which stabilize monosyllabic root morphemes across the countless varieties. As subsyllabic linguistic forms such as derivational morphology can hardly be represented by Chinese characters, compounding is preferred over derivation in Chinese. Compounds do not have fixed word boundaries. The wordhood of compounds pertains to the level of conventionality in language use, which is a continuum instantiated by synchronic gradience and diachronic gradualness. A perennial archaizing aesthetics further complicates the determination of Chinese words by preserving classical linguistic forms in formal and literary writing, thus making every synchronic stratum heterogeneous by blurring the distinction between historical strata. Therefore, the boundaries of words have always been fluid in native speakers’ mental lexicon.

AntConc, 3.5.9

Identifier
DOI https://doi.org/10.18710/HCA0AU
Related Identifier IsCitedBy https://doi.org/10.1515/flin-2024-2023
Metadata Access https://dataverse.no/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.18710/HCA0AU
Provenance
Creator Zhang, Liulin ORCID logo
Publisher DataverseNO
Contributor Zhang, Liulin; Soochow University; Wenjuanxing; The Tromsø Repository of Language and Linguistics (TROLLing)
Publication Year 2025
Funding Reference National Social Science Fund of China (NSSFC) 20FYYB043
Rights CC0 1.0; info:eu-repo/semantics/openAccess; http://creativecommons.org/publicdomain/zero/1.0
OpenAccess true
Contact Zhang, Liulin (Soochow University)
Representation
Resource Type Corpus data; Dataset
Format text/plain; text/csv
Size 10653; 13958; 10348; 73602; 73841; 78046; 74891; 73354; 68804
Version 1.0
Discipline Humanities
Spatial Coverage Mainland China