PunkProse [software]

Dataset

DOI

Punctuation marks support understandability and readability in written language. In spoken language, punctuation of the transcribed speech is influenced by two phenomena: (1) syntax and (2) prosody. We present a software architecture that makes it possible to train punctuation restoration models from any combination of lexical, morphosyntactic, prosodic and acoustic features. Architecture is language independent and feeds on word-segmented data. A dataset compiled from English TED talks is given in http://hdl.handle.net/10230/33981

This software is stored and maintained in the following github repository: https://github.com/alpoktem/punkProse Instructions to use is explained there in detail.

Identifier
DOI	https://doi.org/10.34810/data484
Metadata Access	https://dataverse.csuc.cat/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.34810/data484

Provenance
Creator	Öktem, Alp
Publisher	CORA.Repositori de Dades de Recerca
Publication Year	2023
Funding Reference	European Commission 645012
Rights	Custom Dataset Terms; info:eu-repo/semantics/openAccess; https://dataverse.csuc.cat/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34810/data484
OpenAccess	true

Representation
Resource Type	Program source code; Dataset
Format	text/x-python; text/plain; charset=US-ASCII; application/octet-stream; text/markdown; text/plain; application/x-sh; text/csv; audio/vnd.wave
Size	7228; 1079; 14249; 16427; 1452; 7370; 3099; 3095; 383; 1712; 107; 7381; 356672; 1701; 96; 6406; 144406; 491; 38; 2945; 120872; 2104; 158; 9959; 347018; 3302; 238; 16080; 380042; 4320
Version	1.0
Discipline	Other