This project was carried out by YAI 11th, in cooperation with POZAlabs.
🔎 For the details, please refer to Project Full Report.
🛠️ KIM DONGHA : YAI 8th / AI Dev Lead
🚀 KIM TAEHEON : YAI 10th / AI Research & Dev
👑 LEE SANGMIN : YAI 9th / Team Leader
🐋 LEE SEUNGJAE : YAI 9th / AI Research Lead
🌈 CHOI JEONGWOO : YAI 10th / AI Research & Dev
🌟 CHOI WOOHYEON : YAI 10th / AI Research & Dev
midi2.mp4
git clone https://github.com/YAIxPOZAlabs/MuseDiffusion.git
cd MuseDiffusion
python3 -m pip install virtualenv && \
python3 -m virtualenv venv --python=python3.8 && \
source venv/bin/activate && \
pip3 install -r requirements.txt
(Optional) If required, install python 3.8 for venv usage.
sudo apt update && \
sudo apt install -y software-properties-common && \
sudo add-apt-repository -y ppa:deadsnakes/ppa && \
sudo apt install -y python3.8 python3.8-distutils
(Optional) If anaconda is available, you can set environments by anaconda instead of given code.
conda create -n MuseDiffusion python=3.8 pip wheel
conda activate MuseDiffusion
pip3 install -r requirements.txt
(Recommended) If docker is available, use Dockerfile instead.
docker build -f Dockerfile -t musediffusion:v1 .
python3 -m MuseDiffusion dataprep
- If you want to use custom commu-like dataset, make dataset to npy files (refer to this issue) and preprocess it by this command.
python3 -m MuseDiffusion dataprep --data_dir path/to/dataset
After this step, your directory structure would be like:
MuseDiffusion
├── MuseDiffusion
│ ├── __init__.py
│ ├── config
│ │ ├── __init__.py
│ │ ├── __main__.py
│ │ └── base.py
│ ├── data
│ │ ├── __init__.py
│ │ ├── __main__.py
│ │ ├── corruption.py
│ │ └── ...
│ ├── models
│ │ ├── __init__.py
│ │ ├── denoising_model.py
│ │ ├── gaussian_diffusion.py
│ │ ├── nn.py
│ │ └── ...
│ ├── run
│ │ ├── __init__.py
│ │ ├── sample_generation.py
│ │ ├── sample_seq2seq.py
│ │ └── train.py
│ └── utils
│ ├── __init__.py
│ ├── decode_util.py
│ ├── dist_util.py
│ ├── train_util.py
│ └── ...
├── assets
│ └── (files for readme...)
├── commu
│ └── (same code as https://github.com/POZAlabs/ComMU-code/blob/master/commu/)
├── datasets
│ └── ComMU-processed
│ └── (preprocessed commu dataset files...)
├── scripts
│ ├── run_train.sh
│ ├── sample_seq2seq.sh
│ └── sample_generation.sh
├── README.md
└── requirements.txt
mkdir diffusion_models
mkdir diffusion_models/pretrained_weights
cd diffusion_models/pretrained_weights
wget https://github.com/YAIxPOZAlabs/MuseDiffusion/releases/download/1.0.0/pretrained_weights.zip
unzip pretrained_weights.zip && rm pretrained_weights.zip
cd ../..
python3 -m MuseDiffusion train --distributed
How to customize arguments
- With
--config_json train_cfg.json
required arguments above will be automatically loaded.
# Copy config file to root directory
python3 -c "from MuseDiffusion.config import TrainSettings as T; print(T().json(indent=2))" \
>> train_cfg.json
# Customize config on your own
vi train_cfg.json
# Run training script
python3 -m MuseDiffusion train --distributed --config_json train_cfg.json
- Add your arguments refer to
python3 -m MuseDiffusion train --help
.
Refer to example below:
python3 -m MuseDiffusion train --distributed \
--lr 0.0001 \
--batch_size 2048 \
--microbatch 64 \
--learning_steps 320000 \
--log_interval 20 \
--save_interval 1000 \
--eval_interval 500 \
--ema_rate 0.5,0.9,0.99 \
--seed 102 \
--diffusion_steps 2000 \
--schedule_sampler lossaware \
--noise_schedule sqrt \
--seq_len 2096 \
--pretrained_denoiser diffuseq.pt \
--pretrained_embedding pozalabs_embedding.pt \
--freeze_embedding false \
--use_bucketing true \
--dataset ComMU \
--data_dir datasets/ComMU-processed \
--data_loader_workers 4 \
--use_corruption true \
--corr_available mt,mn,rn,rr \
--corr_max 4 \
--corr_p 0.5 \
--corr_kwargs "{'p':0.4}" \
--hidden_t_dim 128 \
--hidden_dim 128 \
--dropout 0.4 \
--weight_decay 0.1 \
--gradient_clipping -1.0
With regard to --distributed argument (torch.distributed runner)
- Argument
--distributed
will runpython -m MuseDiffusion train
with torch.distributed runner- you can customize options, or environs.
- commandline option
--nproc_per_node
- number of training node (GPU) to use.- default: number of GPU in
CUDA_VISIBLE_DEVICES
environ.
- default: number of GPU in
- commandline option
--master_port
- master port for distributed learning.- default: will automatically be found if available, otherwise
12233
- default: will automatically be found if available, otherwise
- environ
CUDA_VISIBLE_DEVICES
- specific GPU index. e.g:CUDA_VISIBLE_DEVICES=4,5,6,7
- default: not set - in this case, trainer will use all available GPUs.
- environ
OPT_NUM_THREADS
- number of threads for each node.- default: will automatically be set to
$CPU_CORE
/ /$TOTAL_GPU
- default: will automatically be set to
- In windows, torch.distributed is disabled in default.
to enable, edit
USE_DIST_IN_WINDOWS
flag inMuseDiffusion/utils/dist_util.py
.
Refer to example below:
CUDA_VISIBLE_DEVICES=4,5,6,7 python3 -m MuseDiffusion train --distributed --master_port 12233
After training, weights and configs will be saved into ./diffusion_models/{name-of-model-folder}/
.
python3 -m MuseDiffusion modification --distributed \
--use_corruption True \
--corr_available rn,rr \
--corr_max 2 \
--corr_p 0.5 \
--step 500 \
--strength 0.75 \
--model_path ./diffusion_models/{name-of-model-folder}/{weight-file}
- You can use arguments for
torch.distributed
, which is same as training script. - Type
python3 -m MuseDiffusion modification --help
for detailed usage. - You can omit
--model_path
argument, if you want to use pretrained weights.
python3 -m MuseDiffusion generation --distributed \
--bpm {BPM} \
--audio_key {AUDIO_KEY} \
--time_signature {TIME_SIGNATURE} \
--pitch_range {PITCH_RANGE} \
--num_measures {NUM_MEASURES} \
--inst {INST} \
--genre {GENRE} \
--min_velocity {MIN_VELOCITY} \
--max_velocity {MAX_VELOCITY} \
--track_role {TRACK_ROLE} \
--rhythm {RHYTHM} \
--chord_progression {CHORD_PROGRESSION} \
--num_samples 1000 \
--step 500 \
--model_path diffusion_models/{name-of-model-folder}/{weight-file}
- In generation, MidiMeta arguments (bpm, audio_key, ..., chord_progression) are essential.
- You can use arguments for
torch.distributed
, which is same as training script. - Type
python3 -m MuseDiffusion generation --help
for detailed usage. - You can omit
--model_path
argument, if you want to use pretrained weights.
Using MidiMeta JSON file, instead of arguments (Recommended than Commandline Input)
python3 scripts/meta_json_generator.py # This generates meta.json in git root.
# Alter {META_JSON_FILE_PATH} with meta.json
python3 -m MuseDiffusion generation --distributed \
--meta_json {META_JSON_FILE_PATH} \
--num_samples 1000 \
--step 500 \
--model_path diffusion_models/{name-of-model-folder}/{weight-file}
Example Commandline
Refer to example below:
python3 -m MuseDiffusion generation --distributed \
--num_samples 1000 \
--bpm 70 --audio_key aminor --time_signature 4/4 --pitch_range mid_high \
--num_measures 8 --inst acoustic_piano --genre newage \
--min_velocity 60 --max_velocity 80 --track_role main_melody --rhythm standard \
--chord_progression Am-Am-Am-Am-Am-Am-Am-Am-G-G-G-G-G-G-G-G-F-F-F-F-F-F-F-F-E-E-E-E-E-E-E-E-Am-Am-Am-Am-Am-Am-Am-Am-G-G-G-G-G-G-G-G-F-F-F-F-F-F-F-F-E-E-E-E-E-E-E-E
MuseDiffusion : Diffusion model to modify and also generate midi data corresponding to the given meta information.
We chose DiffuSeq as the baseline and use ComMU dataset, where meta and midi are tokenized and paired. Discrete meta and midi datas are projected input into continuous domain using embedding function. We trained the diffusion model and embedding weights jointly and let the MUSE-Diffusion to understand the relation between meta and midi.
Forward Process
An embedding function, called
Reverse Process
Reverse process is to recover the original data
Our objective function is:
$$\large{L(w) = \mathbb E {q{\phi}} [\sum_{t=2}^T||y_0-\tilde f_\theta(z_t,t)||^2 + ||EMB(w^y)-\tilde f_{\theta}(z_1,1)||^2 + R(||z_0||^2)]}$$
Embbeding Space
Our Embedding | ComMU's Embedding |
---|---|
To generate midi using only meta data, we randomly sample gaussian noised note sequence
To Modify corrupted midi to correct midi corresponding to given meta, we impose noise to the corrupted midi until 0.75 * ddim steps and denoise it.
We design the metric for this task. Since our model is based on diffusion, which reconstructs ground-truth MIDI, we need metric to compare paired MIDI. In addition, we should check if generated samples generally follow distribution of ground truth.
To solve the first problem, we define new metric named MSIM(Musical Similarity Index Measure). Similar to SSIM, it separates the task of similarity measurement into three comparisons: rhythm, melody, harmony.
Rhythm similarity
We use groove similarity as baseline to compare rhythm, but our comparison is based on MIDI, so we can get actual velocity and max amplitude based on it. So, instead of calculating RMS of amplitude in 32 separate sections, we decided to calculate actual amplitude of 32 points.
Max amplitude of each note at specific time is as follows:
Music tends to repeat same rhythm in each bar, so we calculate groove vector for each bar and add them to get rhythm vector. With this insight, we define groove vector and rhythm vector are
Melody similarity
To evaluate melody, we evaluate progression of pitch. Since moving same number of semitones are considered similar in music, we define progression vector :
Each similarities are guaranteed as positive since all vectors have only positive vectors.
Harmony similarity
To evaluate harmony, we used chroma similarity as baseline. It just counts each pitch’s appearance. So, chroma vector, harmony vector, and harmony similarity are as follows:
To solve the second problem, we used simple 1NNC based on MSIM. There are several IQA methods that uses pretrained model, but MIDI data don’t have widely used SOTA classifier to use, so we decided to compare only distribution of music. 1NNC uses KNN-classifier by getting nearest neighbor except itself, and calculates accuracy of the classifier. Accuracy
ComMU is a dataset for conditional music generation consists of 12 types of meta data and midi samples manually constructed by professional composers according to corresponding meta. In particular, the ComMU dataset extends the REMI representation, so tokenized note sequences are expressed in the form of a 1d array with tokenized metadata.
ComMU’s 12 metadata are BPM, genre, key, instrument, track-role, time signature, pitch range, number of measures, chord progression, min velocity, max velocity, and rhythm.
Notable properties of ComMU are that (1) dataset is manually constructed by professional composers with an objective guideline that induces regularity, and (2) it has 12 musical metadata that embraces composers' intentions.
Token Map of Midi data and Metadata in ComMU Dataset
Dataset Structure
In order to further reduce the training time, we apply data bucketing when we load the data. Data Bucketing is a method to form batches by grouping similar length sequence together when processing data in batch units. By doing so, we can minimize zero-padding, which can reduce training time and help with denoising step.
We preprocess the data in two ways. First, we move all the chord related tokens inside midi sequence into meta sequence so that chord information helps to create correct midi data. Then, we randomly corrupt the data so the model get the wrong data and learns to create correct note sequence. We conduct four types of data corruption. First, "Masking Token" (mt) randomly replaces each token into a masking token. Second, "Masking Note" (mn) randomly replaces each note with a masking token. Third, "Randomize Note" (rn) changes the velocity, pitch, and duration values of each note to random value. Finally, "Random Rotating" (rr) swaps the position of two randomly selected bars.
corruption arguments for training/sampling
- corr_available: List of corruptions to select: string, separate with comma
- corr_max: Max number of corruptions to select: int
- corr_p: Probability to select each corruption: float
- corr_kwargs: Keyword argument to put in each corruption function: 'eval'-able string
{
"corr_available": "mt,mn,rn,rr",
"corr_max": 4,
"corr_p": 0.5,
"corr_kwargs": "{'p':0.4}"
}
Our model is based on 12 layers of Transformer Encoder, with embedding dimension 500 and diffusion steps to 2,000. We initialize the Transformer from Diffuseq trained weight and embedding function from embedding weight trained by ComMU to perform the same task but using auto-regressive model.
In higher t, where the data is almost Gaussian noise, it is more difficult for the model to denoise than in smaller t. When we divided denoising time step into four part, Q0 to Q3, we observed that losses were relatively higher at Q3 where noise is the highest. Therefore, we trained the model by selecting time t equal to the batch size and performing importance sampling based on recent 10 losses.
In experiments, we move chord information in note sequence to meta data, because target sequence is much longer than source sequence and chord should not be changed.
- You can download Pretrained Weights in Repository Releases.
@inproceedings{hyun2022commu,
title={Com{MU}: Dataset for Combinatorial Music Generation},
author={Lee Hyun and Taehyun Kim and Hyolim Kang and Minjoo Ki and Hyeonchan Hwang and Kwanho Park and Sharang Han and Seon Joo Kim},
booktitle={Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2022},
}
@inproceedings{gong2022diffuseq,
author = {Gong, Shansan and Li, Mukai and Feng, Jiangtao and Wu, Zhiyong and Kong, Lingpeng},
booktitle = {International Conference on Learning Representations, ICLR},
title = {{DiffuSeq}: Sequence to Sequence Text Generation with Diffusion Models},
year = 2023
}
@inproceedings{wolf-etal-2020-transformers,
title = "Transformers: State-of-the-Art Natural Language Processing",
author = "Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin Lhoest and Alexander M. Rush",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
month = oct,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.emnlp-demos.6",
pages = "38--45"
}