Automatic Catalan KWS Database for Projecte AINA

DOI

Automatically extracted Catalan word database using alignment techniques (Montreal Forced Alignment, MFA) from speech databases with transcriptions. Precisely: Mozilla Common Voice, ParlamentParla, and OpenSLR-69. Usable for training keyword spotting models for home automation. MFA leverages algorithms to accurately synchronize speech signals with the corresponding text at the phoneme level.

Two versions of the database have been created:

general: This version encompasses all data, providing a comprehensive dataset for various analyses and applications.

split: This version is divided into train, dev, and test to ease the task of training a keyword spotting model. Speaker-wise, It is divided by 80%, 10%, and 10%.

Identifier
DOI https://doi.org/10.34810/data1400
Metadata Access https://dataverse.csuc.cat/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.34810/data1400
Provenance
Creator Sánchez, Alex ORCID logo; Huerta, Ivan ORCID logo
Publisher CORA.Repositori de Dades de Recerca
Contributor Sánchez, Alex; Fundació Privada i2CAT Internet i Innovació Digital a Catalunya; Fundació Privada i2CAT Internet i Innovació
Publication Year 2024
Rights CC0 1.0; info:eu-repo/semantics/openAccess; http://creativecommons.org/publicdomain/zero/1.0
OpenAccess true
Contact Sánchez, Alex (i2CAT)
Representation
Resource Type Textual data; Dataset
Format application/zip; text/plain; charset=US-ASCII; text/markdown; text/tab-separated-values
Size 2252186113; 2239479987; 25278; 12217; 11665931
Version 1.1
Discipline Other