MCSD 1.0 - Multimodal Chinese Sarcasm Dataset

DOI

This repository includes full text file of Multimodal Chinese Sarcasm Dataset (MCSD), a curated dataset for research on multimodal sarcasm detection in Mandarin Chinese publicly broadcasted stand-up comedy. The corpus is structured as follows:

unique utterance ID for each transcribed segment. manually verified transcription of the spoken utterance (in Mandarin). pseudonymized speaker ID. annotated label (sarcastic / not sarcastic) for each transcription. aligned start and end timestamps. reference to the original publicly available video.

For full dataset description and annotation guidelines, please see: Link

Contributors and roles

Xiyuan Gao (University of Groningen) – PhD researcher. Responsible for dataset design, transcription processing, annotation guideline. Dr. Bruce Xiao Wang (Hong Kong Polytechnic University) – Collaborator and linguistic expert. Contributed to the research framework, research methodology design, and Mandarin discourse insights. Meiling Zhang, Shuming Zhang, and Zhu Li – Carried out manual labeling of sarcasm in the transcribed data based on developed annotation protocols. Dr. Matt Coler & Dr. Shekhar Nayak (University of Groningen) – Supervisors. Provided research supervision and guidance on ethical compliance.

YouTube Data API

Dataset highlights:

Multimodality: This is a novel multimodal sarcasm dataset in Mandarin Chinese, offering resources for cross-lingual sarcasm detection research. Linguistically grounded annotation: The annotation protocol is informed by discourse theory and sarcasm typology, balance sarcasm’s inherent ambiguity with annotator variability in interpretation. Reproducibility-focused design: The dataset was built using a standardized pipeline for data collection, processing, and annotation, enabling reliable replication. To use the pipeline, please refer to the Github repo

Identifier
DOI https://doi.org/10.34894/A0NLTQ
Metadata Access https://dataverse.nl/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.34894/A0NLTQ
Provenance
Creator Gao, Xiyuan ORCID logo; Bruce Xiao Wang ORCID logo; Meiling Zhang; Shuming Huang; Zhu Li ORCID logo; Shekhar Nayak ORCID logo; Matt Coler ORCID logo
Publisher DataverseNL
Contributor Groningen Digital Competence Centre; DataverseNL Network
Publication Year 2025
Rights CC-BY-NC-4.0; info:eu-repo/semantics/openAccess; http://creativecommons.org/licenses/by-nc/4.0
OpenAccess true
Contact Groningen Digital Competence Centre (rug.nl)
Representation
Resource Type Dataset
Format application/zip; application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Size 16927332; 375499; 6654
Version 1.0
Discipline Humanities; Linguistics