CantusCorpus v1.0

Dataset

PID

CantusCorpus 1.0 is a large dataset of Gregorian chant intended for computational research.

The dataset consists of all chants that are accessible through the Cantus Index federated search interface, combining data from 10 individual chant databases. Primarily these are catalogue records: which chants appear in which manuscripts. What allows us to identify multiple instances of a chant across different manuscripts is the Cantus ID mechanism, established from the long history of the Cantus Database. Thus, CantusCorpus 1.0 has two components: chant records (chants.csv), and source - overwhelmingly manuscript - records (sources.csv). CantusCorpus lies inherently downstream of the Cantus Database and the whole Cantus Index network of compatbile chant databases: we do not revisit anyone's editorial decisions. However, the value of this dataset is that the sum of all the editorial decisions made over the databases' decades of existence are being made available as a dataset for computational research.

The PyCantus library (https://github.com/dact-chant/PyCantus) then makes handling this dataset (almost) easy.

The accompanying source code (CantusCorpus-1.0.zip) contains a subdirectory with code and documentation for this particular version of CantusCorpus (v1.0). We expect re-collecting the dataset annually, as the Cantus network grows by tens of thousands of chant records each year.

Identifier
PID	http://hdl.handle.net/11234/1-6041
Related Identifier	https://ufal.mff.cuni.cz/grants/dact
Metadata Access	http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:11234/1-6041

Provenance
Creator	Anna Dvořáková; Debra Lacoste; Hajič jr., Jan
Publisher	Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication Year	2025
Rights	Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0); http://creativecommons.org/licenses/by-nc-sa/4.0/; PUB
OpenAccess	true
Contact	lindat-help(at)ufal.mff.cuni.cz

Representation
Language	Latin
Resource Type	corpus
Format	text/plain; charset=utf-8; application/zip; text/csv; downloadable_files_count: 3
Discipline	Linguistics