Post-OCR correction training dataset sPeriodika-postOCR

Dataset

PID

The post-OCR correction dataset consists of paragraphs of text, at least 100 characters in length, extracted from documents randomly sampled from the sPeriodika dataset (http://hdl.handle.net/11356/1881) of Slovenian historical periodicals. From each document five paragraphs were randomly sampled. If the paragraph was longer than 500 characters, it was trimmed to that length. The correction was performed by one human annotator having access to the scan of the original document. Out of the original collection of 450 paragraphs, 41 were discarded due to non-running text or very bad quality of the OCR.

The metadata in the CSV dataset are the following: - URN of the document - link to the original PDF in dLib - name of the periodical - publisher of the periodical - publication date - original text - corrected text - line offset (zero-indexed) - character length of the paragraph (trimmed to max. 500 characters)

Identifier
PID	http://hdl.handle.net/11356/1907
Related Identifier	https://www.inz.si/en/dihur/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1907

Provenance
Creator	Dobranić, Filip; Konda, Karin; Evkoski, Bojan; Ljubešić, Nikola
Publisher	Institute of Contemporary History
Publication Year	2024
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	corpus
Format	text/plain; charset=utf-8; text/csv; downloadable_files_count: 1
Discipline	Linguistics