Historical Postcards Dataset

DOI

This deposit contains Historical Postcards Dataset (COCO) — v1.0 (2025), a Common Objects in COntext (COCO) format dataset of historical postcard images and structured annotations intended for text and postal markings detection. Printed text detections include transcriptions (manual and OCR), text orientation, and OCR confidence scores — suitable for detection and historical OCR benchmarking. Transcriptions of postal markings, handwritten texts, and scene texts will be added in future versions. The COCO format allows the use of the COCO API (or pycocotools for Python).

Contents & purpose

4,293 postcard images (Image’Est archive, Grand-Est region, France; period ca. 1899–1930) with COCO annotations for text and postal markings detection. The dataset is intended for training/evaluating text detection, OCR pipelines, and postal markings recognition for cultural-heritage research.

This dataset is presented at the 7th workshop on analySis, Understanding and proMotion of heritAge Contents (7th SUMAC @ ACM Multimedia 2025), on 27 October 2025 in Dublin, Ireland. The related paper describes the dataset and methodological details.

Provided archives

Historical_Postcards_Dataset_v1-Train2025.zip — set used in 5-fold cross cross-validation
annotations-Historical_Postcards_Dataset_v1-Train2025.zip — annotations only
Historical_Postcards_Dataset_v1-Test2025.zip — test set used in the conference article
annotations-Historical_Postcards_Dataset_v1-Test2025.zip — annotations only
Historical_Postcards_Dataset_v1-Synth2025.zip — set with synthetic annotations
annotations-Historical_Postcards_Dataset_v1-Synth2025.zip — annotations only

This subdivision facilitates importation into CVAT. More information in the README.md file.

File structure & Annotation schema See the README.md file for more details.

Acknowledgments We would like to thank Image'Est for making historical postcards data available and the Grand Est Region, France who supported this work.

Python, 3.12.8

numpy, 2.1.3

pandas, 2.2.3

pillow, 11.2.1

torch, 2.7.0

ultralytics, 8.3.140

pytesseract, 0.3.13

easyocr, 1.7.2

pycocotools, 2.0.10

Identifier
DOI https://doi.org/10.57745/GELGHH
Related Identifier IsCitedBy https://doi.org/10.1145/3746273.3760201
Metadata Access https://entrepot.recherche.data.gouv.fr/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.57745/GELGHH
Provenance
Creator Pelingre, Matthieu (ORCID: 0000-0002-9754-245X); Tabbone, Salvatore Antoine ORCID logo
Publisher Recherche Data Gouv
Contributor Pelingre, Matthieu; Tabbone, Salvatore Antoine; Image'Est; Conseil régional du Grand Est; Institut des sciences du Digital, Management & Cognition; Université de Lorraine; Entrepôt Recherche Data Gouv
Publication Year 2025
Rights etalab 2.0; info:eu-repo/semantics/openAccess; https://spdx.org/licenses/etalab-2.0.html
OpenAccess true
Contact Pelingre, Matthieu (Université de Lorraine ; France); Tabbone, Salvatore Antoine (LORIA ; Université de Lorraine ; CNRS ; INRIA ; France)
Representation
Resource Type Dataset
Format application/zip; text/markdown
Size 648449; 32750; 279571; 9773078251; 554418692; 5213763327; 5594
Version 1.1
Discipline Humanities; Computer Science