Ground Truth Model for Pracalit for Sanskrit and Newar MSS 16th to 19th C.

DOI

Ground truth data for a an OCR model. Will be continually updated.

Originally trained on Transkribus with a PyLaia model created from ground truth data based on transcripts into Pracalit Unicode of four Nepalese manuscripts. The manuscripts used to create this model are Staatsbibliothek zu Berlin's Hitopadeśa (MIK I 4851) (mixed Newar and Sanskrit dating to 1561) and Vetālapañcaviṃśati (HS. Or. 6414) (Newar dating to 1675) as well as Cambridge Digital Library's Avalokiteśvaraguṇakāraṇḍavyūha (MS Add. 1322) (Sanskrit, 18th century) and the Royal Asiatic Society Online Collection's Madhyamasvayaṃbhūpurāṇa (RAS Hodgson MS 23) (Newar and Sanskrit dating to c. 1800).

The training was done on 441 pages and validation on 242 pages.

This model does not recognise spacing, except for large gaps (i.e. for pictures or string holes). Newar word divider markers may not be represented or may be transcribed as virama. In general, the model is made for MSS with scriptio continua and will transcribe into scriptio continua into Pracalit Unicode.

Transcription was performed by Dr Alexander O'Neill (SOAS University of London). Transcription of the Vetālapañcaviṃśati (HS. Or. 6414) and Madhyamasvayaṃbhūpurāṇa (RAS Hodgson MS 23) was aided by unpublished materials provided by Dr Felix Otter (Philipps-Universität Marburg), as well as the published transcription in Shakya, Min Bahadur, and Shanta Harsha Bajracharya, eds. "Svayambhū Purāṇa." Lalitpur: Nagarjuna Institute of Exact Methods, 2001. The transcription of Avalokiteśvaraguṇakāraṇḍavyūha (MS Add. 1322) was aided by the transcription provided by the Digital Sanskrit Buddhist Canon Project based on Lokesh Chandra, "Guṇakāraṇḍavyūhasūtram," New Delhi: International Academy of Indian Culture, 1999.

Transkribus, 1.20.0–1.20.1

Identifier
DOI https://doi.org/10.11588/data/WI9184
Metadata Access https://heidata.uni-heidelberg.de/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.11588/data/WI9184
Provenance
Creator O'Neill, Alexander ORCID logo
Publisher heiDATA
Contributor O'Neill, Alexander; Hill, Nathan
Publication Year 2022
Funding Reference AHRC UKRI AH/V011235/1
Rights CC BY 4.0; info:eu-repo/semantics/openAccess; http://creativecommons.org/licenses/by/4.0
OpenAccess true
Contact O'Neill, Alexander (SOAS University of London)
Representation
Resource Type Ground Truth; Dataset
Format application/zip
Size 503049108
Version 1.0
Discipline Humanities
Spatial Coverage United Kingdom