The MATCHED dataset is a novel multimodal collection of escort advertisements curated to support research in Authorship Attribution (AA) and related tasks. It comprises 27,619 unique text descriptions and 55,115 images (in jpg format) sourced from Backpage escort ads across seven major U.S. cities–Atlanta, Dallas, Detroit, Houston, Chicago, San Fransisco, and New York. These cities are further categorized into four geographical regions—South, Midwest, West, and Northeast—offering a structured dataset that enables both in-distribution and out-of-distribution (OOD) evaluations. Each ad in the dataset contains metadata that links text and visual components, providing a rich resource for studying multimodal patterns, vendor identification, and verification tasks.
The dataset is uniquely suited for multimodal authorship attribution, vendor linking, stylometric analysis, and understanding the interplay between textual and visual patterns in advertisements. All text descriptions are carefully processed to redact any explicit references to phone numbers, email addresses, advertisement IDs, age-related information, or other contact details that could be used to identify individuals or vendors. The structured metadata allows researchers to explore how multimodal features contribute to uncovering latent patterns in stylometry and vendor behaviors. A demi-data file showcasing the format and structure of our MATCHED dataset is attached with the entry.
Given the sensitivity of the subject matter, the actual dataset resides securely on Maastricht University's servers. Only the metadata will be publicly released on Dataverse to ensure ethical use. Researchers interested in accessing the full dataset must sign a Non-Disclosure Agreement (NDA) and a Data Transfer Agreement with Prof. Dr. Gijs Van Dijck from Maastricht University. Access will only be granted under strict restrictions, and recipients must adhere to the ethical guidelines established by the university's committee. These guidelines emphasize the responsible use of the dataset to prevent misuse and to safeguard the privacy and dignity of all individuals involved.