Automatic Catalan KWS Database for Projecte AINA

Dataset

DOI

Automatically extracted Catalan word database using alignment techniques (Montreal Forced Alignment, MFA) from speech databases with transcriptions. Precisely: Mozilla Common Voice, ParlamentParla, and OpenSLR-69. Usable for training keyword spotting models for home automation. MFA leverages algorithms to accurately synchronize speech signals with the corresponding text at the phoneme level.

Two versions of the database have been created:

general: This version encompasses all data, providing a comprehensive dataset for various analyses and applications.

split: This version is divided into train, dev, and test to ease the task of training a keyword spotting model. Speaker-wise, It is divided by 80%, 10%, and 10%.

Identifier
DOI	https://doi.org/10.34810/data1400
Metadata Access	https://dataverse.csuc.cat/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.34810/data1400

Provenance
Creator	Sánchez, Alex ; Huerta, Ivan
Publisher	CORA.Repositori de Dades de Recerca
Contributor	Sánchez, Alex; Fundació Privada i2CAT Internet i Innovació Digital a Catalunya; Fundació Privada i2CAT Internet i Innovació
Publication Year	2024
Rights	CC0 1.0; info:eu-repo/semantics/openAccess; http://creativecommons.org/publicdomain/zero/1.0
OpenAccess	true
Contact	Sánchez, Alex (i2CAT)

Representation
Resource Type	Textual data; Dataset
Format	application/zip; text/plain; charset=US-ASCII; text/markdown; text/tab-separated-values
Size	2252186113; 2239479987; 25278; 12217; 11665931
Version	1.1
Discipline	Other