CatCoLA - Catalan Corpus of Linguistic Acceptability

DOI

We introduce CatCoLA, the Catalan Corpus of Linguistic Acceptability that will contribute to the Catalan Language Understanding Benchmark (CLUB) to assess and compare the capabilities of language models (LM) trained with texts in Catalan. CatCoLA follows the design of the English CoLA to support the task of classifying sentences as acceptable or not. CatCoLA consists of 10,443 sentences and their acceptability judgements as found in well-known Catalan reference grammars. Additionally, all sentences have been annotated with the class of linguistic phenomenon the sentence is an example of, also following previous practices. CatCoLA is released under a CC BY-NC-SA 4.0 licence and freely available, but the test data to avoid contamination. Please, write a message to nuria.bel@upf.edu if interested in getting it.

Identifier
DOI https://doi.org/10.34810/data1393
Metadata Access https://dataverse.csuc.cat/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.34810/data1393
Provenance
Creator BEL, NURIA ORCID logo; Punsola, Marta; Ruiz-Fernández, Valle
Publisher CORA.Repositori de Dades de Recerca
Contributor BEL, NURIA; Institut de Lingüística Aplicada; Universitat Pompeu Fabra
Publication Year 2024
Funding Reference Ministerio de Ciencia, Innovación y Universidades and Agencia Estatal de Investigación (Spain) PID2019-104512GB-I00
Rights CC BY-SA 4.0; info:eu-repo/semantics/openAccess; http://creativecommons.org/licenses/by-sa/4.0
OpenAccess true
Contact BEL, NURIA (Universitat Pompeu Fabra)
Representation
Resource Type Machine-readable text; Dataset
Format text/tab-separated-values; text/plain
Size 842727; 10674
Version 2.0
Discipline Humanities; Linguistics