Jamendo datasets right now have general tags for all included themes and moods in a song. The problem is that sements of a song can have different themes, instruments and genres.
As things stand we only tag a single section of the work and label the entire song.
Genre is a loosely defined concept. Some genre's have key features and musicologists can label and name sed features.
There are several models and papers that tackle the problem of genre tagging.
-
The cc music model is decent, but the accuracy isn't good enough at labelling data to be part of our process. The error rate is too high. model / demo
-
kaggle competition represents music using high level features that are human labelled. Things like danceability, acousticness, energy and instrumentalness are not necesarily things that we can indicate from a spectogram.
Mood tagging for music is a subset of a problem or class of problems called Multimodal Emotional Recognition. In general we foudnd several models that work for
in my search I was able to find many datasets and papers, but very few fully working models that had decent performance.
-
lileonardo 3-average model - provides code to train a model but does not have a pretrained model.
-
Collection of Datasets for Musical Emotion Recognition that may be used in the future
-
Paper that uses Thayer's model of emotional recognition to classify mood of a song.
-
I also found a paper on Mid Level Features that could be a promising lead if we decide to generate human trained data. It creates features that are not as high level as acousticness, but not as low level as spectral centroid.
-
Service called musixmatch which could be a potential low cost commercial service that we could use to label our segments.
see feature_engineering.ipynb
for more details on that.
During the spike several models were explored. Of note were the speechbrain and the jaml models.
One potential model that was explored was speechbrain music emotion detection model. This one showed a lot of promise but only captured 4 moods: happy
, angry
, neutral
, sad
This model read in a song file and then generated a text description including mood, genre, and theme detection. Overall it was pretty weak and did not manage to correctly label even pop songs with simple structure. I ran these models on 10 second snippets from popular music to get a feel for their efficacy.
These models are best run on NVIDIA GPUs in python 3.10 with an older version of torch.
Mac
brew install sox
Linux For the jmla
apt update
apt install python3.9
pip install virtualenv
virtualenv jaml -p python3.9
source jaml/bin/activate
pip install -r requirements-jmla.txt
pip install -U openmim
mim install mmcv==1.7.1
apt-get install python3-tk
for the speechbrain model
sudo apt-get update
sudo apt-get install sox
pip install -r requirements-speechbrain.txt
Conda
conda install conda-forge::sox