Implementation of MelNet: A Generative Model for Audio in the Frequency Domain
- Tested with Python 3.6.8 & 3.7.4, PyTorch 1.2.0 & 1.3.0.
pip install -r requirements.txt
- Blizzard, VoxCeleb2, and KSS have YAML files provided under
config/
. For other datasets, fill out your own YAML file according to the other provided ones. - Unconditional training is possible for all kinds of datasets, provided that they have a consistent file extension specified by
data.extension
within the YAML file. - Conditional training is currently only implemented for KSS and a subset of the Blizzard dataset.
python trainer.py -c [config YAML file path] -n [name of run] -t [tier number] -b [batch size] -s [TTS]
- Each tier can be trained separately. Since each tier is larger than the one before it (with the exception of tier 1), modify the batch size for each tier.
- Tier 6 of the Blizzard dataset does not fit on a 16GB P100, even with a batch size of 1.
- The
-s
flag is a boolean for determining whether to train a TTS tier. Since a TTS tier only differs at tier 1, this flag is ignored when[tier number] != 0
. Warning: this flag is toggledTrue
no matter what follows the flag. Ignore it if you're not planning to use it.
- Each tier can be trained separately. Since each tier is larger than the one before it (with the exception of tier 1), modify the batch size for each tier.
- The checkpoints must be stored under
chkpt/
. - A YAML file named
inference.yaml
must be provided underconfig/
. inference.yaml
must specify the number of tiers, the names of the checkpoints, and whether or not it is a conditional generation.
python inference.py -c [config YAML file path] -p [inference YAML file path] -t [timestep of generated mel spectrogram] -n [name of sample] -i [input sentence for conditional generation]
- Timestep refers to the length of the mel spectrogram. The ratio of timestep to seconds is roughly
[sample rate] : [hop length of FFT]
. - The
-i
flag is optional, only needed for conditional generation. Surround the sentence with""
and end with.
. - Both unconditional generation and conditional generation currently does not support primed generation (extrapolating from provided data).
- Timestep refers to the length of the mel spectrogram. The ratio of timestep to seconds is roughly
- Implement upsampling procedure
- GMM sampling + loss function
- Unconditional audio generation
- TTS synthesis
- Tensorboard logging
- Multi-GPU training
- Primed generation
- Seungwon Park, June Young Yi, Yoonhyung Lee, Joowhan Song @ Deepest Season 6
MIT License