Skip to content

Latest commit

 

History

History

instrument-classification

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Musical Instrument Classification

The goal of this experiment is to train a model to classify musical instrument from a sample of audio.

We first limit ourselves to fixed-length samples of single harmonic tone played by synthesized MIDI instruments.

Try on Heroku

Datasets:

  • data/midi-instruments.csv
    • "General MIDI Level 1 Instrument Patch Map"
    • source: MIDI specification
    • 128 MIDI instruments, 16 families and tonal and harmonic boolean attributes
    • columns:
      • id - numeric instrument identifier (1-128)
      • name - instrument identifier
      • desc - human-readable instrument name
      • family_name - instrument family identifier
      • family_desc - human readable instrument family name
      • tonal - indicates a tonal instrument (True/False)
      • harmonic - indicates a harmonic instrument (True/False)
  • data/prepared/single-notes-2000
    • 2000 audio samples, each containing a single note from a selected set of instruments
    • generated by:
# code commit: f97a940
# original command (time spent: 2m24.281s):
DATA_DIR=data/prepared/single-notes-2000
time python generate_audio_samples.py -c 2000 -s 42 -o ${DATA_DIR} -f flac

# time: ~2:30 min.
time python extract_features.py ${DATA_DIR}

MIDI to audio synthesis

We'll use FluidSynth for synthesizing audio from MIDI and some sound fonts:

Installation on OS X

./install_fluidsynth_with_soundfonts_osx.sh

Usage

We can synthesize MIDI files using sound fonts to audio files in various formats (eg. WAV, FLAC, etc.). The output format is determined by the file extension.

Either use fluidsynth direcly:

fluidsynth -ni sound_font.sf2 input.mid -F output.wav

Or use a wrapper with a simpler interface:

# convert MIDI to WAV (with a default sound font)
python fluidsynth.py input.mid output.wav
# convert MIDI to FLAC
python fluidsynth.py input.mid output.flac

# playback MIDI
python fluidsynth.py input.mid

# convert MIDI to audio with a specific sound font
python fluidsynth.py -s sounf_font.sf2 input.mid output.flac

The default sound font for fluidsynth.py is stored in ~/.fluidsynth/default_sound_font.

Generating Datasets

generate_chords.py

Just an example of how to generate a MIDI file with a sequence of chords using the music21 library.

generate_audio_samples.py

A script to generate a single-note audio sample to MIDI where several parameters (like pitch, volume, duration, instrument) can be specified.

Genrate a dataset:

# time spent: 2m24.281s:
time python generate_audio_samples.py -c 2000 -s 42 -o data/working/random-notes-2000 -f flac

Then load it (2000 samples of 2 seconds length at 44110 Hz sampling rate):

>>> dataset = SingleToneDataset('data/working/random-notes-2000')
>>> dataset.samples.shape
(2000, 88200)

extract_features.py

Computes reassigned chromagrams (log-frequency log-amplitude spectrograms with bins aligned aligned to pitches) from the audio.

The output for each audio clip is a matrix (time_frames, chroma_bins). Together they are in a matrix of shape (data_points, time_frames, chroma_bins).

The 2 second clips have shape (44, 115).

Params:

  • block_size: 4096
  • hop_size: 2048
  • bin range: [-48, 67]
    • midi tones: [21, 136] (but midi number goes to 127)
    • pitches: [A0, E10]
  • bin division: 1
  • base frequency: 440 Hz

The bins are numbered from 0 = A4 = midi 69. One bin step is one semitone.

Training an instrument family classifier

prepare_training_data.py

Prepares data for training (splits, scaling, etc.).

Training - train.py

Trains a model on the single-notes-2000 dataset.

Currently there's a neural net with a few layers of convolution, then few dense layers. Implemented in Keras.

Evaluation

evaluate.py

Computes and plots various evaluation data from the training process.

  • learning curves - overfittering or underfitting?
  • confusion matrices - What are the errors between target classes?
  • How is the classification error distributed with respect to pitch?

Normally this is called from within train.py, but you can call it again afterwards.

inspect_errors.py

Listen to the misclassified audio samples.

leaderboard.py

See the best models so far and ranking of the latest models.

Prediction - predict.py

Predicts instrument family from a single audio file using a trained model.

  • classify_instrument.sh - a wrapper with a path to a model fixed
  • predict_webapp.py - a web interface for prediction

Results

One of the best models so far has validation error of around 5% and AUC ~0.99. It is composed of 8 convolution layers, 1 softmax and has around 220k parameters. On a GPU training takes around 2 minutes.

Deployed demo app