MCSD 1.0 - Multimodal Chinese Sarcasm Dataset - Dataset

Dataset

MCSD 1.0 - Multimodal Chinese Sarcasm Dataset

DOI

This repository includes full text file of Multimodal Chinese Sarcasm Dataset (MCSD), a curated dataset for research on multimodal sarcasm detection in Mandarin Chinese publicly broadcasted stand-up comedy. The corpus is structured as follows:

unique utterance ID for each transcribed segment. manually verified transcription of the spoken utterance (in Mandarin). pseudonymized speaker ID. annotated label (sarcastic / not sarcastic) for each transcription. aligned start and end timestamps. reference to the original publicly available video.

For full dataset description and annotation guidelines, please see: Link

Contributors and roles

Xiyuan Gao (University of Groningen) – PhD researcher. Responsible for dataset design, transcription processing, annotation guideline. Dr. Bruce Xiao Wang (Hong Kong Polytechnic University) – Collaborator and linguistic expert. Contributed to the research framework, research methodology design, and Mandarin discourse insights. Meiling Zhang, Shuming Zhang, and Zhu Li – Carried out manual labeling of sarcasm in the transcribed data based on developed annotation protocols. Dr. Matt Coler & Dr. Shekhar Nayak (University of Groningen) – Supervisors. Provided research supervision and guidance on ethical compliance.

YouTube Data API

Dataset highlights:

Multimodality: This is a novel multimodal sarcasm dataset in Mandarin Chinese, offering resources for cross-lingual sarcasm detection research. Linguistically grounded annotation: The annotation protocol is informed by discourse theory and sarcasm typology, balance sarcasm’s inherent ambiguity with annotator variability in interpretation. Reproducibility-focused design: The dataset was built using a standardized pipeline for data collection, processing, and annotation, enabling reliable replication. To use the pipeline, please refer to the Github repo

Identifier
DOI	https://doi.org/10.34894/A0NLTQ
Metadata Access	https://dataverse.nl/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.34894/A0NLTQ

Provenance
Creator	Gao, Xiyuan ; Bruce Xiao Wang ; Meiling Zhang; Shuming Huang; Zhu Li ; Shekhar Nayak ; Matt Coler
Publisher	DataverseNL
Contributor	Groningen Digital Competence Centre; DataverseNL Network
Publication Year	2025
Rights	CC-BY-NC-4.0; info:eu-repo/semantics/openAccess; http://creativecommons.org/licenses/by-nc/4.0
OpenAccess	true
Contact	Groningen Digital Competence Centre (rug.nl)

Representation
Resource Type	Dataset
Format	application/zip; application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Size	16927332; 375499; 6654
Version	1.0
Discipline	Humanities; Linguistics