CantusCorpus v1.0

PID

CantusCorpus 1.0 is a large dataset of Gregorian chant intended for computational research.

The dataset consists of all chants that are accessible through the Cantus Index federated search interface, combining data from 10 individual chant databases. Primarily these are catalogue records: which chants appear in which manuscripts. What allows us to identify multiple instances of a chant across different manuscripts is the Cantus ID mechanism, established from the long history of the Cantus Database. Thus, CantusCorpus 1.0 has two components: chant records (chants.csv), and source - overwhelmingly manuscript - records (sources.csv). CantusCorpus lies inherently downstream of the Cantus Database and the whole Cantus Index network of compatbile chant databases: we do not revisit anyone's editorial decisions. However, the value of this dataset is that the sum of all the editorial decisions made over the databases' decades of existence are being made available as a dataset for computational research.

The PyCantus library (https://github.com/dact-chant/PyCantus) then makes handling this dataset (almost) easy.

The accompanying source code (CantusCorpus-1.0.zip) contains a subdirectory with code and documentation for this particular version of CantusCorpus (v1.0). We expect re-collecting the dataset annually, as the Cantus network grows by tens of thousands of chant records each year.

Identifier
PID http://hdl.handle.net/11234/1-6041
Related Identifier https://ufal.mff.cuni.cz/grants/dact
Metadata Access http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:11234/1-6041
Provenance
Creator Anna Dvořáková; Debra Lacoste; Hajič jr., Jan
Publisher Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication Year 2025
Rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0); http://creativecommons.org/licenses/by-nc-sa/4.0/; PUB
OpenAccess true
Contact lindat-help(at)ufal.mff.cuni.cz
Representation
Language Latin
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; text/csv; downloadable_files_count: 3
Discipline Linguistics