An unofficial implementation of Vector Quantization Voice Conversion(VQVC, D. Y. Wu et. al., 2020)
-
Install dependencies.
- python=3.7
- pytorch=1.7
pip install -r requirements.txt
-
Download dataset and pretrained VocGAN model.
- Please download VCTK dataset and edit
dataset_path
inconfig.py
. - Download VocGAN pretrained model
- Please download VCTK dataset and edit
-
Preprocess
- preprocess mel-spectrogram via following command:
python prepro.py 1 1
- first argument: mel-preprocessing
- second argument: metadata split (You may change the portion of samples used on train/eval via
data_split_ratio
inconfig.py
)
-
Train the model
python train.py
- In
config.py
, you may edittrain_visible_device
to choose GPU for training. - Same as paper, 60K steps are enough.
- Training the model spends only 30 minutes.
- In
-
Voice conversion
- After training, point the source and reference speech for voice conversion. (You may edit
src_paths
andref_paths
inconversion.py
.) - As a result of conversion, you may check samples in directory
results
.
python conversion.py
- After training, point the source and reference speech for voice conversion. (You may edit
Train loss visualization
- reconstruction loss
- commitment loss
- perplexity of codebook
- total loss
Mel-spectrogram visualization
- Ground-truth(top), reconstructed-mel(top-middle), Contents-mel(bottom-middle), Style-mel(bottom, i.e., encoder output subtracted by code)
-
You may hear audio samples.
-
Visualization of converted mel-spectrogram
- source mel(top), reference mel(middle), converted mel(bottom)
- download pretrained VQVC model and place it in
ckpts/VCTK-Corpus/
- download pretrained VocGAN model and place it in
vocoder/vocgan/pretrained_models
- Trimming silence and stride of convolution are very important to transfer the style from reference speech.
- Unlike paper, I used NVIDIA's preprocessing method to use pretrained VocGAN model.
- Training is very unstable. (After 70K steps, perplexity of codebook is substantially decreased to 1.)
- (Future work) The model trained on Korean Emotional Speech dataset is not completed yet.
- One-shot Voice Conversion by Vector Quantization (D. Y. Wu et. al., 2020)
- VocGAN implementation by rishikksh20
- NVIDIA's preprocessing method