Ground Truth Model for Pracalit for Sanskrit and Newar MSS 16th to 19th C.

Dataset

DOI

Ground truth data for a an OCR model. Will be continually updated.

Originally trained on Transkribus with a PyLaia model created from ground truth data based on transcripts into Pracalit Unicode of four Nepalese manuscripts. The manuscripts used to create this model are Staatsbibliothek zu Berlin's Hitopadeśa (MIK I 4851) (mixed Newar and Sanskrit dating to 1561) and Vetālapañcaviṃśati (HS. Or. 6414) (Newar dating to 1675) as well as Cambridge Digital Library's Avalokiteśvaraguṇakāraṇḍavyūha (MS Add. 1322) (Sanskrit, 18th century) and the Royal Asiatic Society Online Collection's Madhyamasvayaṃbhūpurāṇa (RAS Hodgson MS 23) (Newar and Sanskrit dating to c. 1800).

The training was done on 441 pages and validation on 242 pages.

This model does not recognise spacing, except for large gaps (i.e. for pictures or string holes). Newar word divider markers may not be represented or may be transcribed as virama. In general, the model is made for MSS with scriptio continua and will transcribe into scriptio continua into Pracalit Unicode.

Transcription was performed by Dr Alexander O'Neill (SOAS University of London). Transcription of the Vetālapañcaviṃśati (HS. Or. 6414) and Madhyamasvayaṃbhūpurāṇa (RAS Hodgson MS 23) was aided by unpublished materials provided by Dr Felix Otter (Philipps-Universität Marburg), as well as the published transcription in Shakya, Min Bahadur, and Shanta Harsha Bajracharya, eds. "Svayambhū Purāṇa." Lalitpur: Nagarjuna Institute of Exact Methods, 2001. The transcription of Avalokiteśvaraguṇakāraṇḍavyūha (MS Add. 1322) was aided by the transcription provided by the Digital Sanskrit Buddhist Canon Project based on Lokesh Chandra, "Guṇakāraṇḍavyūhasūtram," New Delhi: International Academy of Indian Culture, 1999.

Transkribus, 1.20.0–1.20.1

Identifier
DOI	https://doi.org/10.11588/data/WI9184
Metadata Access	https://heidata.uni-heidelberg.de/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.11588/data/WI9184

Provenance
Creator	O'Neill, Alexander
Publisher	heiDATA
Contributor	O'Neill, Alexander; Hill, Nathan
Publication Year	2022
Funding Reference	AHRC UKRI AH/V011235/1
Rights	CC BY 4.0; info:eu-repo/semantics/openAccess; http://creativecommons.org/licenses/by/4.0
OpenAccess	true
Contact	O'Neill, Alexander (SOAS University of London)

Representation
Resource Type	Ground Truth; Dataset
Format	application/zip
Size	503049108
Version	1.0
Discipline	Humanities
Spatial Coverage	United Kingdom