This is a meta-repository containing all corpora published and curated by the Digital and Cognitive Musicology Lab Lausanne. Twelve corpora are publicly available; more are to follow.
- ABC
- The Annotated Beethoven Corpus containing L. v. Beethoven's string quartets.
- beethoven_piano_sonatas
- Ludwig van Beethoven - Piano Sonatas [DOI][ZIP]
- corelli
- Arcangelo Corelli's trio sonatas opp. 1, 3 and 4.
- chopin_mazurkas
- Frédéric Chopin - Mazurkas [DOI][ZIP]
- debussy_suite_bergamasque
- Claude Debussy - Suite Bergamasque [DOI][ZIP]
- dvorak_silhouettes
- Antonín Dvořák - Silhouettes [DOI][ZIP]
- grieg_lyrical_pieces
- Edvard Grieg - Lyric Pieces [DOI][ZIP]
- liszt_pelerinage
- Franz Liszt - Années de Pèlerinage [DOI][ZIP]
- medtner_tales
- Nikolai Medtner - Tales [DOI][ZIP]
- mozart_piano_sonatas
- All piano sonatas by W.A. Mozart.
- schumann_kinderszenen
- Robert Schumann - Kinderszenen [DOI][ZIP]
- tchaikovsky_seasons
- Pyotr Tchaikovsky - The Seasons [DOI][ZIP]
At the heart of every subcorpus is a folder called MS3
containing a set of annotated music scores in the MuseScore file format .mscx
. In order to display the files you need to download the data to your computer and open them with MuseScore 3. For example, the beginning of the file ABC/MS3/n08op59-2_01.mscx
looks like this:
In addition to the annotated scores in the MS3
folder, the following folders contain the same information in a tabular format:
- notes: TSV files representing one note per row
- measures: TSV files representing one measure per row
- harmonies: TSV files representing one harmony label per row
- chords: TSV files where each row represent a set of notes with the same onset and duration, appearin in the same notational layer. Columns represent various dynamics, articulation sings, staff texts, figured bass, etc.
The TSV files (tab-separated values) can be opened with any modern data processor, programming language, or spreadsheet, for example with LibreOffice Calc. They were created with the MuseScore parser ms3 which can be used to extract other information from MuseScore files, too, such as articulation, lyrics, or rests. Its documentation includes information on what the columns in the above-mentioned TSV files contain.
The harmonic analysis in the above example follows the DCML harmonic annotation standard. The labels were entered into the scores by professional music theorists.
Since the second half of 2023, all releases of DCML corpora are accompanied by frictionless datapackages. The datapackage contains the following files:
- dcml_corpora.zip, a ZIP file containing one TSV file per facet, that corresponds to a concatenation of the TSV files in the respective folders of all corpora, that is * dcml_corpora.chords.tsv * dcml_corpora.expanded.tsv * dcml_corpora.measures.tsv * dcml_corpora.metadata.tsv (concatenation of a single file) * dcml_corpora.notes.tsv
- dcml_corpora.datapackage.json, the package descriptor.
If one has the frictionless framework installed and downloaded both files, one can use the descriptor to validate the package using the command
`bash
frictionless validate dcml_corpora.datapackage.json
`
This repository contains submodules. You can use this command to clone it
git clone --recurse-submodules -j8 https://github.com/DCMLab/dcml_corpora.git
If you are unfamiliar with Git, you can download the corpora individually as
ZIP files. Click on the respective folder above (e.g. ABC @ <commit>
) and
click on (the green button) Code -> Download ZIP
.