Skip to content

smartist1401/so-vits-svc-5.0

Β 
Β 

Repository files navigation

Variational Inference with adversarial learning for end-to-end Singing Voice Conversion based on VITS

Hugging Face Spaces Open in Colab GitHub Repo stars GitHub forks GitHub issues GitHub

  • πŸ’—This project is target for: beginners in deep learning, the basic operation of Python and PyTorch is the prerequisite for using this project;
  • πŸ’—This project aims to help deep learning beginners get rid of boring pure theoretical learning, and master the basic knowledge of deep learning by combining it with practice;
  • πŸ’—This project does not support real-time voice change; (support needs to replace whisper)
  • πŸ’—This project will not develop one-click packages for other purposesοΌ›

sovits_framework

  • 6G memory GPU can be used to trained

  • support for multiple speakers

  • create unique speakers through speaker mixing

  • even with light accompaniment can also be converted

  • F0 can be edited using Excel

Model properties

https://github.com/PlayVoice/so-vits-svc-5.0/releases/tag/hifigan_release

  • sovits5.0_main_1500.pth The model includes: generator + discriminator = 176M, which can be used as a pre-training model
  • speakers files are in the configs/singers directory, which can be used for reasoning tests, especially for timbre leakage
  • speakers 22, 30, 47, and 51 are highly recognizable, and the training audio samples are in the configs/singers_sample directory
Feature From Status Function Remarks
whisper OpenAI βœ… strong noise immunity -
bigvgan NVIDA βœ… alias and snake The GPU takes up a little more, and the main branch is deleted; You need to switch to the branch bigvgan,the formant is clearer and the sound quality is obviously improved
natural speech Microsoft βœ… reduce mispronunciation -
neural source-filter NII βœ… solve the problem of audio F0 discontinuity -
speaker encoder Google βœ… Timbre Encoding and Clustering -
GRL for speaker Ubisoft βœ… Preventing Encoder Leakage Timbre -
one shot vits Samsung βœ… Voice Clone -
SCLN Microsoft βœ… Improve Clone -
PPG perturbation this project βœ… Improved noise immunity and de-timbre -
VAE perturbation this project βœ… Improve sound quality -

πŸ’—due to the use of data perturbation, it takes longer to train than other projects.

Dataset preparation

Necessary pre-processing:

  • 1 accompaniment separation
  • 2 band extension
  • 3 sound quality improvement
  • 4 cut audio, less than 30 seconds for whisperπŸ’—

then put the dataset into the dataset_raw directory according to the following file structure

dataset_raw
β”œβ”€β”€β”€speaker0
β”‚   β”œβ”€β”€β”€000001.wav
β”‚   β”œβ”€β”€β”€...
β”‚   └───000xxx.wav
└───speaker1
    β”œβ”€β”€β”€000001.wav
    β”œβ”€β”€β”€...
    └───000xxx.wav

Install dependencies

  • 1 software dependency

    apt update && sudo apt install ffmpeg

    pip install -r requirements.txt

  • 2 download the Timbre Encoder: Speaker-Encoder by @mueller91, put best_model.pth.tar into speaker_pretrain/

  • 3 download whisper model multiple language medium model, Make sure to download medium.pt,put it into whisper_pretrain/

  • 4 whisper is built-in, do not install it additionally, it will conflict and report an error

Data preprocessing

  • 1, set working directory:

    export PYTHONPATH=$PWD

  • 2, re-sampling

    generate audio with a sampling rate of 16000Hz:./data_svc/waves-16k

    python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-16k -s 16000

    generate audio with a sampling rate of 32000Hz:./data_svc/waves-32k

    python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-32k -s 32000

  • 3, use 16K audio to extract pitch:f0_ceil=900, it needs to be modified according to the highest pitch of your data

    python prepare/preprocess_f0.py -w data_svc/waves-16k/ -p data_svc/pitch

    or use next for low quality audio

    python prepare/preprocess_f0_crepe.py -w data_svc/waves-16k/ -p data_svc/pitch

  • 4, use 16K audio to extract ppg

    python prepare/preprocess_ppg.py -w data_svc/waves-16k/ -p data_svc/whisper

  • 5, use 16k audio to extract timbre code

    python prepare/preprocess_speaker.py data_svc/waves-16k/ data_svc/speaker

  • 6, extract the average value of the timbre code for inference; it can also replace a single audio timbre in generating the training index, and use it as the unified timbre of the speaker for training

    python prepare/preprocess_speaker_ave.py data_svc/speaker/ data_svc/singer

  • 7, use 32k audio to extract the linear spectrum

    python prepare/preprocess_spec.py -w data_svc/waves-32k/ -s data_svc/specs

  • 8, use 32k audio to generate training index

    python prepare/preprocess_train.py

  • 9, training file debugging

    python prepare/preprocess_zzz.py

data_svc/
└── waves-16k
β”‚    └── speaker0
β”‚    β”‚      β”œβ”€β”€ 000001.wav
β”‚    β”‚      └── 000xxx.wav
β”‚    └── speaker1
β”‚           β”œβ”€β”€ 000001.wav
β”‚           └── 000xxx.wav
└── waves-32k
β”‚    └── speaker0
β”‚    β”‚      β”œβ”€β”€ 000001.wav
β”‚    β”‚      └── 000xxx.wav
β”‚    └── speaker1
β”‚           β”œβ”€β”€ 000001.wav
β”‚           └── 000xxx.wav
└── pitch
β”‚    └── speaker0
β”‚    β”‚      β”œβ”€β”€ 000001.pit.npy
β”‚    β”‚      └── 000xxx.pit.npy
β”‚    └── speaker1
β”‚           β”œβ”€β”€ 000001.pit.npy
β”‚           └── 000xxx.pit.npy
└── whisper
β”‚    └── speaker0
β”‚    β”‚      β”œβ”€β”€ 000001.ppg.npy
β”‚    β”‚      └── 000xxx.ppg.npy
β”‚    └── speaker1
β”‚           β”œβ”€β”€ 000001.ppg.npy
β”‚           └── 000xxx.ppg.npy
└── speaker
β”‚    └── speaker0
β”‚    β”‚      β”œβ”€β”€ 000001.spk.npy
β”‚    β”‚      └── 000xxx.spk.npy
β”‚    └── speaker1
β”‚           β”œβ”€β”€ 000001.spk.npy
β”‚           └── 000xxx.spk.npy
└── singer
    β”œβ”€β”€ speaker0.spk.npy
    └── speaker1.spk.npy

Train

  • 0, if fine-tuning based on the pre-trained model, you need to download the pre-trained model: sovits5.0_main_1500.pth

    set pretrain: "./sovits5.0_main_1500.pth" in configs/base.yaml,and adjust the learning rate appropriately, eg 1e-5

  • 1, set working directory

    export PYTHONPATH=$PWD

  • 2, start training

    python svc_trainer.py -c configs/base.yaml -n sovits5.0

  • 3, resume training

    python svc_trainer.py -c configs/base.yaml -n sovits5.0 -p chkpt/sovits5.0/***.pth

  • 4, view log

    tensorboard --logdir logs/

sovits5 0_base

Inference

  • 1, set working directory

    export PYTHONPATH=$PWD

  • 2, export inference model: text encoder, Flow network, Decoder network

    python svc_export.py --config configs/base.yaml --checkpoint_path chkpt/sovits5.0/***.pt

  • 3, use whisper to extract content encoding, without using one-click reasoning, in order to reduce GPU memory usage

    python whisper/inference.py -w test.wav -p test.ppg.npy

    generate test.ppg.npy; if no ppg file is specified in the next step, generate it automatically

  • 4, extract the F0 parameter to the csv text format, open the csv file in Excel, and manually modify the wrong F0 according to Audition or SonicVisualiser

    python pitch/inference.py -w test.wav -p test.csv

  • 5,specify parameters and infer

    python svc_inference.py --config configs/base.yaml --model sovits5.0.pth --spk ./configs/singers/singer0001.npy --wave test.wav --ppg test.ppg.npy --pit test.csv

    when --ppg is specified, when the same audio is reasoned multiple times, it can avoid repeated extraction of audio content codes; if it is not specified, it will be automatically extracted;

    when --pit is specified, the manually tuned F0 parameter can be loaded; if not specified, it will be automatically extracted;

    generate files in the current directory:svc_out.wav

    args --config --model --spk --wave --ppg --pit --shift
    name config path model path speaker wave input wave ppg wave pitch pitch shift

Creat singer

named by pure coincidence:average -> ave -> eva,eve(eva) represents conception and reproduction

python svc_eva.py

eva_conf = {
    './configs/singers/singer0022.npy': 0,
    './configs/singers/singer0030.npy': 0,
    './configs/singers/singer0047.npy': 0.5,
    './configs/singers/singer0051.npy': 0.5,
}

the generated singer file is:eva.spk.npy

πŸ’—both Flow and Decoder need to input timbres, and you can even input different timbre parameters to the two modules to create more unique timbres.

Data set

Name URL
KiSing http://shijt.site/index.php/2021/05/16/kising-the-first-open-source-mandarin-singing-voice-synthesis-corpus/
PopCS https://github.com/MoonInTheRiver/DiffSinger/blob/master/resources/apply_form.md
opencpop https://wenet.org.cn/opencpop/download/
Multi-Singer https://github.com/Multi-Singer/Multi-Singer.github.io
M4Singer https://github.com/M4Singer/M4Singer/blob/master/apply_form.md
CSD https://zenodo.org/record/4785016#.YxqrTbaOMU4
KSS https://www.kaggle.com/datasets/bryanpark/korean-single-speaker-speech-dataset
JVS MuSic https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_music
PJS https://sites.google.com/site/shinnosuketakamichi/research-topics/pjs_corpus
JUST Song https://sites.google.com/site/shinnosuketakamichi/publication/jsut-song
MUSDB18 https://sigsep.github.io/datasets/musdb.html#musdb18-compressed-stems
DSD100 https://sigsep.github.io/datasets/dsd100.html
Aishell-3 http://www.aishelltech.com/aishell_3
VCTK https://datashare.ed.ac.uk/handle/10283/2651

Code sources and references

https://github.com/facebookresearch/speech-resynthesis paper

https://github.com/jaywalnut310/vits paper

https://github.com/openai/whisper/ paper

https://github.com/NVIDIA/BigVGAN paper

https://github.com/mindslab-ai/univnet paper

https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/master/project/01-nsf

https://github.com/brentspell/hifi-gan-bwe

https://github.com/mozilla/TTS

https://github.com/OlaWod/FreeVC paper

SNAC : Speaker-normalized Affine Coupling Layer in Flow-based Architecture for Zero-Shot Multi-Speaker Text-to-Speech

Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers

AdaSpeech: Adaptive Text to Speech for Custom Voice

Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis

Learn to Sing by Listening: Building Controllable Virtual Singer by Unsupervised Learning from Voice Recordings

Adversarial Speaker Disentanglement Using Unannotated External Data for Self-supervised Representation Based Voice Conversion

Speaker normalization (GRL) for self-supervised speech emotion recognition

Method of Preventing Timbre Leakage Based on Data Perturbation

https://github.com/auspicious3000/contentvec/blob/main/contentvec/data/audio/audio_utils_1.py

https://github.com/revsic/torch-nansy/blob/main/utils/augment/praat.py

https://github.com/revsic/torch-nansy/blob/main/utils/augment/peq.py

https://github.com/biggytruck/SpeechSplit2/blob/main/utils.py

https://github.com/OlaWod/FreeVC/blob/main/preprocess_sr.py

Contributors

About

Core Engine of Singing Voice Conversion & Singing Voice Clone

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%