Skip to content

An unofficial implementation of Vector Quantization Voice Conversion (VQVC).

License

Notifications You must be signed in to change notification settings

Jackson-Kang/VQVC-Pytorch

Repository files navigation

VQVC-Pytorch

An unofficial implementation of Vector Quantization Voice Conversion(VQVC, D. Y. Wu et. al., 2020)

model

How-to-run

  1. Install dependencies.

    pip install -r requirements.txt
    
  2. Download dataset and pretrained VocGAN model.

  3. Preprocess

    • preprocess mel-spectrogram via following command:
    python prepro.py 1 1
    
    • first argument: mel-preprocessing
    • second argument: metadata split (You may change the portion of samples used on train/eval via data_split_ratio in config.py)
  4. Train the model

    python train.py
    
    • In config.py, you may edit train_visible_device to choose GPU for training.
    • Same as paper, 60K steps are enough.
    • Training the model spends only 30 minutes.
  5. Voice conversion

    • After training, point the source and reference speech for voice conversion. (You may edit src_paths and ref_paths in conversion.py.)
    • As a result of conversion, you may check samples in directory results.
    python conversion.py
    

Visualization of training

Train loss visualization

  • reconstruction loss

recon_loss

  • commitment loss

commitment_loss

  • perplexity of codebook

perplexity

  • total loss

total_loss

Mel-spectrogram visualization

  • Ground-truth(top), reconstructed-mel(top-middle), Contents-mel(bottom-middle), Style-mel(bottom, i.e., encoder output subtracted by code)

melspectrogram

Inference results

  • You may hear audio samples.

  • Visualization of converted mel-spectrogram

    • source mel(top), reference mel(middle), converted mel(bottom)

converted_melspectrogram

Pretrained models

  1. VQVC pretrained model
  • download pretrained VQVC model and place it in ckpts/VCTK-Corpus/
  1. VocGAN pretrained model
  • download pretrained VocGAN model and place it in vocoder/vocgan/pretrained_models

Experimental Notes

  • Trimming silence and stride of convolution are very important to transfer the style from reference speech.
  • Unlike paper, I used NVIDIA's preprocessing method to use pretrained VocGAN model.
  • Training is very unstable. (After 70K steps, perplexity of codebook is substantially decreased to 1.)
  • (Future work) The model trained on Korean Emotional Speech dataset is not completed yet.

References (or acknowledgements)

About

An unofficial implementation of Vector Quantization Voice Conversion (VQVC).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages