- 💗This project is target for: beginners in deep learning, the basic operation of Python and PyTorch is the prerequisite for using this project;
- 💗This project aims to help deep learning beginners get rid of boring pure theoretical learning, and master the basic knowledge of deep learning by combining it with practice;
- 💗This project does not support real-time voice change; (support needs to replace whisper)
- 💗This project will not develop one-click packages for other purposes;
-
6G memory GPU can be used to trained
-
support for multiple speakers
-
create unique speakers through speaker mixing
-
even with light accompaniment can also be converted
-
F0 can be edited using Excel
https://github.com/PlayVoice/so-vits-svc-5.0/releases/tag/hifigan_release
- sovits5.0_main_1500.pth The model includes: generator + discriminator = 176M, which can be used as a pre-training model
- speakers files are in the configs/singers directory, which can be used for reasoning tests, especially for timbre leakage
- speakers 22, 30, 47, and 51 are highly recognizable, and the training audio samples are in the configs/singers_sample directory
Feature | From | Status | Function | Remarks |
---|---|---|---|---|
whisper | OpenAI | ✅ | strong noise immunity | - |
bigvgan | NVIDA | ✅ | alias and snake | The GPU takes up a little more, and the main branch is deleted; You need to switch to the branch bigvgan,the formant is clearer and the sound quality is obviously improved |
natural speech | Microsoft | ✅ | reduce mispronunciation | - |
neural source-filter | NII | ✅ | solve the problem of audio F0 discontinuity | - |
speaker encoder | ✅ | Timbre Encoding and Clustering | - | |
GRL for speaker | Ubisoft | ✅ | Preventing Encoder Leakage Timbre | - |
one shot vits | Samsung | ✅ | Voice Clone | - |
SCLN | Microsoft | ✅ | Improve Clone | - |
PPG perturbation | this project | ✅ | Improved noise immunity and de-timbre | - |
VAE perturbation | this project | ✅ | Improve sound quality | - |
💗due to the use of data perturbation, it takes longer to train than other projects.
Necessary pre-processing:
- 1 accompaniment separation
- 2 band extension
- 3 sound quality improvement
- 4 cut audio, less than 30 seconds for whisper💗
then put the dataset into the dataset_raw directory according to the following file structure
dataset_raw
├───speaker0
│ ├───000001.wav
│ ├───...
│ └───000xxx.wav
└───speaker1
├───000001.wav
├───...
└───000xxx.wav
-
1 software dependency
apt update && sudo apt install ffmpeg
pip install -r requirements.txt
-
2 download the Timbre Encoder: Speaker-Encoder by @mueller91, put
best_model.pth.tar
intospeaker_pretrain/
-
3 download whisper model multiple language medium model, Make sure to download
medium.pt
,put it intowhisper_pretrain/
-
4 whisper is built-in, do not install it additionally, it will conflict and report an error
-
1, set working directory:
export PYTHONPATH=$PWD
-
2, re-sampling
generate audio with a sampling rate of 16000Hz:./data_svc/waves-16k
python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-16k -s 16000
generate audio with a sampling rate of 32000Hz:./data_svc/waves-32k
python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-32k -s 32000
-
3, use 16K audio to extract pitch:f0_ceil=900, it needs to be modified according to the highest pitch of your data
python prepare/preprocess_f0.py -w data_svc/waves-16k/ -p data_svc/pitch
or use next for low quality audio
python prepare/preprocess_f0_crepe.py -w data_svc/waves-16k/ -p data_svc/pitch
-
4, use 16K audio to extract ppg
python prepare/preprocess_ppg.py -w data_svc/waves-16k/ -p data_svc/whisper
-
5, use 16k audio to extract timbre code
python prepare/preprocess_speaker.py data_svc/waves-16k/ data_svc/speaker
-
6, extract the average value of the timbre code for inference; it can also replace a single audio timbre in generating the training index, and use it as the unified timbre of the speaker for training
python prepare/preprocess_speaker_ave.py data_svc/speaker/ data_svc/singer
-
7, use 32k audio to extract the linear spectrum
python prepare/preprocess_spec.py -w data_svc/waves-32k/ -s data_svc/specs
-
8, use 32k audio to generate training index
python prepare/preprocess_train.py
-
9, training file debugging
python prepare/preprocess_zzz.py
data_svc/
└── waves-16k
│ └── speaker0
│ │ ├── 000001.wav
│ │ └── 000xxx.wav
│ └── speaker1
│ ├── 000001.wav
│ └── 000xxx.wav
└── waves-32k
│ └── speaker0
│ │ ├── 000001.wav
│ │ └── 000xxx.wav
│ └── speaker1
│ ├── 000001.wav
│ └── 000xxx.wav
└── pitch
│ └── speaker0
│ │ ├── 000001.pit.npy
│ │ └── 000xxx.pit.npy
│ └── speaker1
│ ├── 000001.pit.npy
│ └── 000xxx.pit.npy
└── whisper
│ └── speaker0
│ │ ├── 000001.ppg.npy
│ │ └── 000xxx.ppg.npy
│ └── speaker1
│ ├── 000001.ppg.npy
│ └── 000xxx.ppg.npy
└── speaker
│ └── speaker0
│ │ ├── 000001.spk.npy
│ │ └── 000xxx.spk.npy
│ └── speaker1
│ ├── 000001.spk.npy
│ └── 000xxx.spk.npy
└── singer
├── speaker0.spk.npy
└── speaker1.spk.npy
-
0, if fine-tuning based on the pre-trained model, you need to download the pre-trained model: sovits5.0_main_1500.pth
set pretrain: "./sovits5.0_main_1500.pth" in configs/base.yaml,and adjust the learning rate appropriately, eg 1e-5
-
1, set working directory
export PYTHONPATH=$PWD
-
2, start training
python svc_trainer.py -c configs/base.yaml -n sovits5.0
-
3, resume training
python svc_trainer.py -c configs/base.yaml -n sovits5.0 -p chkpt/sovits5.0/***.pth
-
4, view log
tensorboard --logdir logs/
-
1, set working directory
export PYTHONPATH=$PWD
-
2, export inference model: text encoder, Flow network, Decoder network
python svc_export.py --config configs/base.yaml --checkpoint_path chkpt/sovits5.0/***.pt
-
3, use whisper to extract content encoding, without using one-click reasoning, in order to reduce GPU memory usage
python whisper/inference.py -w test.wav -p test.ppg.npy
generate test.ppg.npy; if no ppg file is specified in the next step, generate it automatically
-
4, extract the F0 parameter to the csv text format, open the csv file in Excel, and manually modify the wrong F0 according to Audition or SonicVisualiser
python pitch/inference.py -w test.wav -p test.csv
-
5,specify parameters and infer
python svc_inference.py --config configs/base.yaml --model sovits5.0.pth --spk ./configs/singers/singer0001.npy --wave test.wav --ppg test.ppg.npy --pit test.csv
when --ppg is specified, when the same audio is reasoned multiple times, it can avoid repeated extraction of audio content codes; if it is not specified, it will be automatically extracted;
when --pit is specified, the manually tuned F0 parameter can be loaded; if not specified, it will be automatically extracted;
generate files in the current directory:svc_out.wav
args --config --model --spk --wave --ppg --pit --shift name config path model path speaker wave input wave ppg wave pitch pitch shift
named by pure coincidence:average -> ave -> eva,eve(eva) represents conception and reproduction
python svc_eva.py
eva_conf = {
'./configs/singers/singer0022.npy': 0,
'./configs/singers/singer0030.npy': 0,
'./configs/singers/singer0047.npy': 0.5,
'./configs/singers/singer0051.npy': 0.5,
}
the generated singer file is:eva.spk.npy
💗both Flow and Decoder need to input timbres, and you can even input different timbre parameters to the two modules to create more unique timbres.
https://github.com/facebookresearch/speech-resynthesis paper
https://github.com/jaywalnut310/vits paper
https://github.com/openai/whisper/ paper
https://github.com/NVIDIA/BigVGAN paper
https://github.com/mindslab-ai/univnet paper
https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/master/project/01-nsf
https://github.com/brentspell/hifi-gan-bwe
https://github.com/mozilla/TTS
https://github.com/OlaWod/FreeVC paper
Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers
AdaSpeech: Adaptive Text to Speech for Custom Voice
Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis
Speaker normalization (GRL) for self-supervised speech emotion recognition
https://github.com/auspicious3000/contentvec/blob/main/contentvec/data/audio/audio_utils_1.py
https://github.com/revsic/torch-nansy/blob/main/utils/augment/praat.py
https://github.com/revsic/torch-nansy/blob/main/utils/augment/peq.py
https://github.com/biggytruck/SpeechSplit2/blob/main/utils.py
https://github.com/OlaWod/FreeVC/blob/main/preprocess_sr.py