Content-based annotation of page images from the (archaeological) historical archive

Dataset

PID

This dataset employs a comprehensive 11-label classification scheme to categorize scanned images of document pages. The types are based on their content and presentation format. The scheme distinguishes between visual content (drawings, maps, paintings, schematics, and photographs), textual content (handwritten, printed, or machine-typed), and hybrid formats that combine multiple elements. Special attention is given to layout characteristics, with separate labels designated for content presented in tabular or form-like structures versus paragraph or block formats. For instance, we differentiate between standard drawings (DRAW📈) and drawings with table-based legends (DRAW_L📈📏), as well as between regular photographs (PHOTO🌄) and those embedded within tabular layouts (PHOTO_L🌄📏).

The textual categories are particularly nuanced, distinguishing between three input methods—handwritten (✏️), printed (📄), and machine-typed (📄)—and further subdividing these based on structural organization. Text can appear in either tabular/form-like arrangements (LINE_HW, LINE_P, LINE_T) or in traditional paragraph/block formats (TEXT_HW, TEXT_P, TEXT_T). An additional TEXT📰 category accommodates mixed documents that combine multiple text types or include minor graphical elements, providing flexibility for complex real-world documents.

The dataset is organized using a 5-fold cross-validation structure, with each fold maintaining an 80-10-10 split for training, development, and test sets respectively. This partitioning information is documented in an accompanying CSV file, enabling robust model evaluation and the potential for ensemble approaches where models trained on different folds can be averaged together to create a more robust combined model, provided they share the same base architecture.

Identifier
PID	http://hdl.handle.net/20.500.12800/1-5959
Metadata Access	http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:20.500.12800/1-5959

Provenance
Creator	Lutsai,Kateryna; Křivánková,Dana
Publisher	Charles University in Prague, UFAL
Publication Year	2025
Funding Reference	info:eu-repo/grantAgreement/EC/HORIZON-RIA/101132163
Rights	Public Domain Mark (PD); http://creativecommons.org/publicdomain/mark/1.0/; PUB
OpenAccess	true
Contact	lindat-help(at)ufal.mff.cuni.cz

Representation
Resource Type	IMAGE
Format	text/csv; image/png; application/zip; application/octet-stream; text/plain; charset=utf-8; downloadable_files_count: 18
Discipline	Linguistics