The goal of this experiment is to train a model to classify musical instrument from a sample of audio.
We first limit ourselves to fixed-length samples of single harmonic tone played by synthesized MIDI instruments.
data/midi-instruments.csv
- "General MIDI Level 1 Instrument Patch Map"
- source: MIDI specification
- 128 MIDI instruments, 16 families and tonal and harmonic boolean attributes
- columns:
- id - numeric instrument identifier (1-128)
- name - instrument identifier
- desc - human-readable instrument name
- family_name - instrument family identifier
- family_desc - human readable instrument family name
- tonal - indicates a tonal instrument (True/False)
- harmonic - indicates a harmonic instrument (True/False)
data/prepared/single-notes-2000
- 2000 audio samples, each containing a single note from a selected set of instruments
- generated by:
# code commit: f97a940
# original command (time spent: 2m24.281s):
DATA_DIR=data/prepared/single-notes-2000
time python generate_audio_samples.py -c 2000 -s 42 -o ${DATA_DIR} -f flac
# time: ~2:30 min.
time python extract_features.py ${DATA_DIR}
We'll use FluidSynth for synthesizing audio from MIDI and some sound fonts:
./install_fluidsynth_with_soundfonts_osx.sh
We can synthesize MIDI files using sound fonts to audio files in various formats (eg. WAV, FLAC, etc.). The output format is determined by the file extension.
Either use fluidsynth direcly:
fluidsynth -ni sound_font.sf2 input.mid -F output.wav
Or use a wrapper with a simpler interface:
# convert MIDI to WAV (with a default sound font)
python fluidsynth.py input.mid output.wav
# convert MIDI to FLAC
python fluidsynth.py input.mid output.flac
# playback MIDI
python fluidsynth.py input.mid
# convert MIDI to audio with a specific sound font
python fluidsynth.py -s sounf_font.sf2 input.mid output.flac
The default sound font for fluidsynth.py
is stored in ~/.fluidsynth/default_sound_font
.
Just an example of how to generate a MIDI file with a sequence of chords using the music21
library.
A script to generate a single-note audio sample to MIDI where several parameters (like pitch, volume, duration, instrument) can be specified.
Genrate a dataset:
# time spent: 2m24.281s:
time python generate_audio_samples.py -c 2000 -s 42 -o data/working/random-notes-2000 -f flac
Then load it (2000 samples of 2 seconds length at 44110 Hz sampling rate):
>>> dataset = SingleToneDataset('data/working/random-notes-2000')
>>> dataset.samples.shape
(2000, 88200)
Computes reassigned chromagrams (log-frequency log-amplitude spectrograms with bins aligned aligned to pitches) from the audio.
The output for each audio clip is a matrix (time_frames, chroma_bins). Together they are in a matrix of shape (data_points, time_frames, chroma_bins).
The 2 second clips have shape (44, 115).
Params:
- block_size: 4096
- hop_size: 2048
- bin range: [-48, 67]
- midi tones: [21, 136] (but midi number goes to 127)
- pitches: [A0, E10]
- bin division: 1
- base frequency: 440 Hz
The bins are numbered from 0 = A4 = midi 69. One bin step is one semitone.
Prepares data for training (splits, scaling, etc.).
Trains a model on the single-notes-2000
dataset.
Currently there's a neural net with a few layers of convolution, then few dense layers. Implemented in Keras.
Computes and plots various evaluation data from the training process.
- learning curves - overfittering or underfitting?
- confusion matrices - What are the errors between target classes?
- How is the classification error distributed with respect to pitch?
Normally this is called from within train.py
, but you can call it again afterwards.
Listen to the misclassified audio samples.
See the best models so far and ranking of the latest models.
Predicts instrument family from a single audio file using a trained model.
classify_instrument.sh
- a wrapper with a path to a model fixedpredict_webapp.py
- a web interface for prediction
One of the best models so far has validation error of around 5% and AUC ~0.99. It is composed of 8 convolution layers, 1 softmax and has around 220k parameters. On a GPU training takes around 2 minutes.