Content-based annotation of page images from the (archaeological) historical archive

PID

This dataset employs a comprehensive 11-label classification scheme to categorize scanned images of document pages. The types are based on their content and presentation format. The scheme distinguishes between visual content (drawings, maps, paintings, schematics, and photographs), textual content (handwritten, printed, or machine-typed), and hybrid formats that combine multiple elements. Special attention is given to layout characteristics, with separate labels designated for content presented in tabular or form-like structures versus paragraph or block formats. For instance, we differentiate between standard drawings (DRAW📈) and drawings with table-based legends (DRAW_L📈📏), as well as between regular photographs (PHOTO🌄) and those embedded within tabular layouts (PHOTO_L🌄📏).

The textual categories are particularly nuanced, distinguishing between three input methods—handwritten (✏️), printed (📄), and machine-typed (📄)—and further subdividing these based on structural organization. Text can appear in either tabular/form-like arrangements (LINE_HW, LINE_P, LINE_T) or in traditional paragraph/block formats (TEXT_HW, TEXT_P, TEXT_T). An additional TEXT📰 category accommodates mixed documents that combine multiple text types or include minor graphical elements, providing flexibility for complex real-world documents.

The dataset is organized using a 5-fold cross-validation structure, with each fold maintaining an 80-10-10 split for training, development, and test sets respectively. This partitioning information is documented in an accompanying CSV file, enabling robust model evaluation and the potential for ensemble approaches where models trained on different folds can be averaged together to create a more robust combined model, provided they share the same base architecture.

Identifier
PID http://hdl.handle.net/20.500.12800/1-5959
Metadata Access http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:20.500.12800/1-5959
Provenance
Creator Lutsai,Kateryna; Křivánková,Dana
Publisher Charles University in Prague, UFAL
Publication Year 2025
Funding Reference info:eu-repo/grantAgreement/EC/HORIZON-RIA/101132163
Rights Public Domain Mark (PD); http://creativecommons.org/publicdomain/mark/1.0/; PUB
OpenAccess true
Contact lindat-help(at)ufal.mff.cuni.cz
Representation
Resource Type IMAGE
Format text/csv; image/png; application/zip; application/octet-stream; text/plain; charset=utf-8; downloadable_files_count: 18
Discipline Linguistics