CantusCorpus 1.0 is a large dataset of Gregorian chant intended for computational research.
The dataset consists of all chants that are accessible through the Cantus Index federated search interface, combining data from 10 individual chant databases. Primarily these are catalogue records: which chants appear in which manuscripts. What allows us to identify multiple instances of a chant across different manuscripts is the Cantus ID mechanism, established from the long history of the Cantus Database.
Thus, CantusCorpus 1.0 has two components: chant records (chants.csv), and source - overwhelmingly manuscript - records (sources.csv). CantusCorpus lies inherently downstream of the Cantus Database and the whole Cantus Index network of compatbile chant databases: we do not revisit anyone's editorial decisions. However, the value of this dataset is that the sum of all the editorial decisions made over the databases' decades of existence are being made available as a dataset for computational research.
The PyCantus library (https://github.com/dact-chant/PyCantus) then makes handling this dataset (almost) easy.
The accompanying source code (CantusCorpus-1.0.zip) contains a subdirectory with code and documentation for this particular version of CantusCorpus (v1.0). We expect re-collecting the dataset annually, as the Cantus network grows by tens of thousands of chant records each year.