Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad quality of generated speech after training #5

Open
SolomidHero opened this issue Jan 29, 2021 · 5 comments
Open

Bad quality of generated speech after training #5

SolomidHero opened this issue Jan 29, 2021 · 5 comments

Comments

@SolomidHero
Copy link

Hello! I made some preprocessing to get features of wavs in dataset for training EA-SVC. Actually, I get the following features:

  • PPG from hidden state of model trained on TIMIT dataset (768 dim)
  • f0 with WORLD by direct use of pyworld (1 dim, zeros in f0 are not processed)
  • spk embeds using pyannote.audio

I tried training for first 2 stages (i.e. without adversarial generator training and then with it) on both LibriSpeech dev-clean and NUS48E singing. Disentaglement loss wasn't used in experiment. So, for the 1st stage loss_g(g_mag + g_sc) is about 1.0; for the 2nd: loss_g increased to 5.0 (g_mag + g_sc + g_adv + g_feat), loss_d is about 3.0e-01 (d_real + d_fake). Model wasn't trained for 3rd stage. In both dataset experiments results are quite the same.

Because generated audio on both stages are not good, I wonder if I made a mistake in training process or something. I believe losses values above will give you a better view of this situation.

P.S. Number of stage refers to such parameter in config:

  1. "adv_ag": false, "adv_fd": false
  2. "adv_ag": true, "adv_fd": false
  3. "adv_ag": true, "adv_fd": true
@SolomidHero SolomidHero changed the title Bad quality result after training on LibriSpeech Bad quality result after training Jan 29, 2021
@SolomidHero SolomidHero changed the title Bad quality result after training Bad quality of generated speech after training Jan 29, 2021
@SolomidHero
Copy link
Author

SolomidHero commented Feb 1, 2021

Should I preprocess f0 features with high and low frequencies cutoff, linear interpolation for zero values segments, normalizing?

@980202006
Copy link

hi,which dataset you use for training?

@SolomidHero
Copy link
Author

Tried on both Librispeech dev or NUS-48E

@leerumor
Copy link

@SolomidHero hi, have you solve this problem? I also had a bad quality, my STFT loss doesn't converge (nearly 3), and can even not fit a single song...

@SolomidHero
Copy link
Author

@leerumor, Hi
Sorry for the late answer. I managed that it is hard task to learn wav2wav conversion.
It should have been trained many epochs and have large enough dataset 50hrs+ minimum.
If GAN model is used it might not converge :(

About this repository particularly I couldn't train good model based on dataset above and moved on

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants