NCSE v2.0: A Dataset of OCR-Processed 19th Century English Newspapers

DOI

NCSE v2.0 Dataset RepositoryThis repository contains the NCSE v2.0 dataset and associated supporting data used in the paper "Reading the unreadable: Creating a dataset of 19th century English newspapers using image-to-text language models".Dataset OverviewThe NCSE v2.0 is a digitized collection of six 19th-century English periodicals containing:82,690 pages1.4 million entries321 million words1.9 billion charactersThe dataset includes:1.1 million text entries198,000 titles17,000 figure descriptions16,000 tablesRepository ContentsNCSE v2.0 DatasetNCSE_v2.zip: a folder containing a parquet file for each of the periodicals as well as a readme file.Bounding Box DatasetA zip file called bounding_box.zip. Containspost_process: A folder of the processed periodical bounding box datapost_process_fill: A folder of the processed periodical bounding box data WITH column filling.bbox_readme.txt: a readme file and data description for the bounding boxesTest Setscropped_images.zip: 378 images cropped from the NCSE test set pages, all 2-bit png filesground_truth: 358 text files corresponding to the text from the cropped_images folderClassification Training DataThe below files are used for training the classification models. They contain 12000 observations 2000 from each periodical. The labels were classified using mistral-large-2411. This data is used to train the ModernBERT classifier described in the paper. The topics are taken from the International Press Telecommunications Council (IPTC) subject codes.silver_IPTC_class.parquet: IPTC topic classification silver setsilver_text_type.parquet: Text-type classification silver setClassified DataThe zip file "classification_data.zip" with all rows classified using the ModernBERT classifer described in the paper.IPTC_type_classified.zip: contains one parquet file per periodicaltext_type_classified.zip: contains one parquet file per periodicalclassification_readme.md: Description of the dataClassification MappingsData for mapping the classification codes to human readable names.class_mappings.zip: contains a json for each classification typeIPTC_class_mapping.jsontext_type_class_mapping.jsonOriginal ImagesThe original page images can be found at the King's College London Repositories:Monthly RepositoryNorthern StarLeaderEnglish Woman's JournalTomahawkPublishers' CircularOr via the project central archiveCitationIf you use this dataset, please cite:No citation data currently availableRelated CodeAll original code related to this project including the creation of the datasets and thier analysis can be found at:https://github.com/JonnoB/ereading_the_unreadableContactFor questions about the dataset, please create an issue in this repository.Usage RightsIn keeping with the original NCSE dataset, all data is made available under a Creative Commons Attribution 4.0 International License (CC BY).

Identifier
DOI https://doi.org/10.5522/04/28381610.v1
Related Identifier HasPart https://ndownloader.figshare.com/files/52256369
Related Identifier HasPart https://ndownloader.figshare.com/files/52256378
Related Identifier HasPart https://ndownloader.figshare.com/files/52260608
Related Identifier HasPart https://ndownloader.figshare.com/files/52260611
Related Identifier HasPart https://ndownloader.figshare.com/files/52260746
Related Identifier HasPart https://ndownloader.figshare.com/files/52260821
Related Identifier HasPart https://ndownloader.figshare.com/files/52260923
Related Identifier HasPart https://ndownloader.figshare.com/files/52260941
Metadata Access https://api.figshare.com/v2/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=oai:figshare.com:article/28381610
Provenance
Creator Bourne, Jonno ORCID logo
Publisher University College London UCL
Contributor Figshare
Publication Year 2025
Rights https://creativecommons.org/licenses/by/4.0/
OpenAccess true
Contact researchdatarepository(at)ucl.ac.uk
Representation
Language English
Resource Type Dataset
Discipline History; Humanities