This repository includes full text file of Multimodal Chinese Sarcasm Dataset (MCSD), a curated dataset for research on multimodal sarcasm detection in Mandarin Chinese publicly broadcasted stand-up comedy. The corpus is structured as follows:
unique utterance ID for each transcribed segment.
manually verified transcription of the spoken utterance (in Mandarin).
pseudonymized speaker ID.
annotated label (sarcastic / not sarcastic) for each transcription.
aligned start and end timestamps.
reference to the original publicly available video.
For full dataset description and annotation guidelines, please see: Link
Contributors and roles
Xiyuan Gao (University of Groningen) – PhD researcher. Responsible for dataset design, transcription processing, annotation guideline.
Dr. Bruce Xiao Wang (Hong Kong Polytechnic University) – Collaborator and linguistic expert. Contributed to the research framework, research methodology design, and Mandarin discourse insights.
Meiling Zhang, Shuming Zhang, and Zhu Li – Carried out manual labeling of sarcasm in the transcribed data based on developed annotation protocols.
Dr. Matt Coler & Dr. Shekhar Nayak (University of Groningen) – Supervisors. Provided research supervision and guidance on ethical compliance.
YouTube Data API
Dataset highlights:
Multimodality: This is a novel multimodal sarcasm dataset in Mandarin Chinese, offering resources for cross-lingual sarcasm detection research.
Linguistically grounded annotation: The annotation protocol is informed by discourse theory and sarcasm typology, balance sarcasm’s inherent ambiguity with annotator variability in interpretation.
Reproducibility-focused design: The dataset was built using a standardized pipeline for data collection, processing, and annotation, enabling reliable replication. To use the pipeline, please refer to the Github repo