-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Home
Welcome to the so-vits-svc-fork wiki!
Windows:
py -3.10 -m venv venv
venv\Scripts\activate
Linux/MacOS:
python3.10 -m venv venv
source venv/bin/activate
Anaconda:
conda create -n so-vits-svc-fork python=3.10 pip
conda activate so-vits-svc-fork
Installing without creating a virtual environment may cause a PermissionError
if Python is installed in Program Files, etc.
Based on VITS, the decoder was changed to NSF-HiFiGAN, the input was changed to ContentVec, ContentVec prediction by pitch decoder and clustering during inference was added to reduce input speech leakage, and Conv1d was added before the original network.
This repository may be becoming more outdated day by day. It is a good idea to try your hand at other projects.
-
liujing04/Retrieval-based-Voice-Conversion-WebUI, ddPn08/rvc-webui Hyperparameters and model size are almost the same, but additional features have been implemented in inference to reduce speaker information leakage.
It is rumored by some people that training is quite faster than so-vits-svc, but it is doubtful. Conversely, if you feel that the quality is inferior, maybe you are simply lacking in training. - PlayVoice/so-vits-svc-5.0 Again, all parts (decoder, feature extractor etc.) were changed.
- PlayVoice/lora-svc Similar to the repository above, but using LoRA.
- yxlllc/DDSP-SVC High speed audio conversion with DDSP vocoder.
- fishaudio/fish-diffusion HiFiSinger and DiffSinger are elaborately implemented.
We have tried to implement some of these, but it was too difficult. (#340)
If you are thinking of creating a speech synthesis repository, please take a look at this repository and fish-diffusion's Pytorch Lightning implementation to write beautiful code that would increase code reusability and make everyone happy. Also, if you like this template, you can see 34j/pypackage-template-fork.
Most parameters have not been changed from those in VITS and seems to be "empirical" (or left alone).
Changing the train
parameter basically will not cause the need to reset the model; changing LR, etc. may improve the training speed. The segment_size
parameter is the length of the audio and the tensor to be passed to the decoder, and increasing it may increase the training speed of the decoder, but may not make much difference because of increased VRAM usage.
Changing the model
parameter may reset some or all of the weights. The current model may be too large for one speaker or may not. Simply reducing the number of channels does not seem to be effective. Changing the decoder to MS-iSTFT-Generator
etc. seems to double the inference speed.
The ssl_dim
is the number of input channels, and the correct number of output channels for the officially trained ContentVec
model is 768, but after applying final_proj it is 256.
Lacking understanding about this model and need to study more basics .... I shouldn't waste my time at the computer working on these too unimportant bugs...
Note that these trained models differ in some training hyperparameters, so this is not an exact comparison. Training was conducted using the same initial weights. Weights that did not match the size were reset. The dataset contains 150 audio files.
NSF-Hifi-GAN Generator, ~1.3?k epochs, 192/768 channels
individualAudio.4.mp4
MS-iSTFT Generator, 29.4k epochs, 192/768 channels (somehow has noise)
individualAudio.2.mp4
NSF-Hifi-GAN Generator, 17.4k epochs, 32/64 channels
individualAudio.1.mp4
Name | fp32 (TFLOPS) | fp16 (TFLOPS) | VRAM (GB) |
---|---|---|---|
T4 | 8.141 | 65.13 | 16 |
P100 | 9.626 | 19.05 | 16 |
P4000 | 5.304 | 0.08288 | 8 |
P5000 | 8.873 | 0.1386 | 16 |
P6000 | 12.63 | 0.1974 | 24 |
V100 PCIe | 14.13 | 28.26 | 16/32 |
V100 SXM2 | 15.67 | 31.33 | 16/32 |
RTX4000 | 7.119 | 14.24 | 8 |
RTX5000 | 11.15 | 22.3 | 16 |
A4000 | 19.17 | 19.17 | 16 |
A5000 | 27.77 | 27.77 | 24 |
A6000 | 38.71 | 38.71 | 48 |
A100 PCIe/SXM4 | 19.49 | 77.97 | 40/80 |
RTX 3090 | 35.58 | 35.58 | 24 |
RTX 4090 | 82.58 | 82.58 | 24 |
RX 7900 XTX | 61.42 | 122.8 | 24 |
wookayin/gpustat: 📊 A simple command-line utility for querying and monitoring GPU status.
Debugging realtime inference on YouTube, etc. with RTX Voice
- Windows:
Speakers (NVidia RTX Voice)
- RTX Voice Input:
(Default Device)
- RTX Voice Output: `CABLE Input (VB-Audio Virtual Cable)
- so-vits-svc-fork GUI Input: `CABLE Output (VB-Audio Virtual Cable)
- so-vits-svc-fork GUI Output:
(your device)