Notes on Finetuning #81

Kreevoz · 2023-11-25T20:16:22Z

Kreevoz
Nov 25, 2023

I've made a few notes during finetuning runs and figure we could maybe pool our insights into one discussion to help everyone iterate efficiently. I don't claim these to be anything more than my own observations/compilation of useful notes. Take them with a grain of salt, especially since there is such rapid development happening as of writing this. I am not affiliated with the authors of this lovely TTS model.
Also take alook through closed issues if you're running into trouble. There is some useful information in them.

Teaching the model new features

Is possible.
If your dataset contains punctuation that was not present in the original dataset for the finetuned checkpoint (but is accepted by the espeak phonemizer and passed through), then the model will learn the new features relatively quickly if they have a good match in the audio files.
It is also possible to make the model forget unwanted features by never including them.

Text dataset quality

The phonemization+tokenization processing of StyleTTS2 distinguishes between opening and closing quote pairs and tags them differently in the tokenized phoneme transcription. The model picks up on these differences. But this does not work for stray quotes.
Preprocess the text carefully to maximize the amount of sentence punctuation that espeak carries through to the phonemized output. Ensure that punctuation matches pauses. This will make the model a lot more predictable and less likely to skip over punctuation.
The LibriTTS dataset has poor punctuation and a mismatch of spoken/unspoken pauses with the transcripts. This is a common oversight in many datasets.
Also it lacks variety of punctuation. In the field, you may encounter texts with creative use of dashes, pauses and combination of quotes and punctuation. LibriTTS lacks those cases. But the model can learn these!
Additionally, LibriTTS has stray quotes in some texts, or begins a sentence with a quote. These things reduce quality a little (or a lot, sometimes). You will want to filter those out.

Robustness

StyleTTS2 seems quite robust overall.
The model can be trained with audiofiles that have baked-in effects. They will be reproduced.

Artifacts

Artifacts seem to be - in part - the result of a too short max_len (in addition to poor audio cleanup and a low quality transcription of course)
At a max_len of 100 (=1.25 seconds), finetuning is possible, but the start and end of generated audio may accumulate distortion and pops.
At a max_len of 800 (=10 seconds), quality is excellent even after one epoch and improves on subsequent iterations. This length covers the majority of audio datasets (as you know the free standard datasets adhere to the duration limitations that autoregressive models like tacotron 1/2 established years ago - due to their attention-mechanism imploding after 10-12 seconds)
max_len of 400 and 600 also works well.
For small datasets (shorter than 1 hour) you can generally set this value much higher than on big datasets, since overall the VRAM usage is lower to start with.
Providing a clean reference audio file to compute the style helps a great deal in mitigating artifacts.
Consider adding 100ms of silence to the end of all audio files in your dataset, and add a stop-token to the end of all of your dataset sentences. More details further down in the comment discussions. This works really well and can massively reduce artifacts at the end of long generated sentences.
Refer to the repo author's comment below if you wish to use $ as a stoptoken.

Finetune training Stages

Base

You can train with both Style Diffusion and SLM Adversarial Training disabled.
If your dataset is relatively normal (read: boring, human, similar to the LibriTTS voices), skipping finetuning style diffusion can work, but won't deliver all that this model architecture is capable of.

Style Diffusion

config parameter name: diff_epoch
This parameter starts counting at 0. For example to start diffusion training on epoch 5, set this parameter to (5-1) 4
You can disable style diffusion training by setting diff_epoch to a value that is larger than your total number of epochs.
For large datasets, having it start on the second epoch can work.
For smaller datasets, start it on a later epoch as you will need to iterate through many epochs anyway. Look at the defaults of the finetune config file that ships with this repo.
Not finetuning this stage saves some VRAM at the expense of worse inference quality.

SLM Adversarial Training

config parameter name: joint_epoch
This parameter starts counting at 0. For example to start SLM adversarial training on epoch 10, set this parameter to (10-1) 9
joint_epoch must be set to a higher number than diff_epoch, or you will encounter an error. You cannot run SLM Adversarial Training before you begin running Style Diffusion training.
You can disable SLM adversarial training by setting joint_epoch to a value that is larger than your total number of epochs.
Additional noteworthy parameter: batch_percentage
Defaults to 0.5. Adds (batch_size * 0.5) number of batches with SLM Adversarial samples.
Calculate your batch size accordingly with some spare VRAM if you plan to make use of it.
(For example the previous batch size was 6 without SLM adv. training, the new batch size is 4 (since 4*0.5 =) 2 batches will get added, totaling 6 again)
SLM Adv training is (moderately) heavy on computational resources and VRAM.
If you run out of memory, lower your batchsize by 2 and resume finetuning from your last saved checkpoint. Do not lower your batchsize below 2.
Make sure to save your checkpoints at the right intervals so you do not lose progress.
You can also be cheeky and finish a finetuning run with SLM Adv. training disabled, and then resume finetuning from your final checkpoint, with SLM enabled, and adjusted batch_size, for a few extra epochs.
Not finetuning this stage lowers computational cost and lowers inference quality.
You can reduce the VRAM usage of this stage by adjusting the min_len and max_len under the slmadv_params section in the config file within reason.

Errors, Crashes

RuntimeError: Calculated padded input size per channel: (5 x 4). Kernel size: (5 x 5). Kernel size can't be greater than actual input size

One or both of the following conditions are present:

A max_len of less than 100
Audiofiles that are significantly shorter than 1 second.
The fix is very simple though: Remove short <1s audiofiles or merge them into longer files with merged transcripts. Ensure that max_len is at least 100.

Codepage error under Windows ( UnicodeDecodeError: 'charmap' codec can't decode byte .. )
Check the Operating Systems section below.

RuntimeError: The expanded size of the tensor (SOME NUMBER HERE) must match the existing size (512) at non-singleton dimension 1.

The input text is too long. If this is happening during training, check your dataset and split up extremely long sentences into more manageable ones. Make sure that if you use a custom OOD text, you split sentences on punctuation and ensure they don't become entire paragraphs. Anything that would take you longer than 10 seconds to speak is probably a candidate for splitting in half.

IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

batch_size must be 2 or greater, or you will run into this error.

UnboundLocalError: local variable 'ref' referenced before assignment

If this appears when your finetuning is trying to begin the SLM Adversarial Training, then your diff_epoch is set to a later epoch than joint_training. for example: diff_epoch = 5 , joint_training = 4 is not valid. You would want joint_training to be the bigger number.

RuntimeError: Given groups=1, weight of size [1, 1, 3], expected input[1, 221, 1] to have 1 channels, but got 221 channels instead

If you are running finetuning across multiple GPUs, your chosen batch_size may be too small and result in each GPU only getting a batch of 1. Increase the batch_size.

Mixed-precision Training

If you don't want to run training in full precision, you can now run finetuning at mixed-precision.

accelerate launch --mixed_precision=fp16 --num_processes=1 train_finetune_accelerate.py --config_path ./Path/To/Your/config_ft.yml
Mixed-precision training only works with a single GPU:
Multi-GPU accelerate finetuning/ second-stage training is currently bugged. Check the main repo page or issue 7 for more information if you think you can help fix that.
You can expect minor ( ~10% ) savings in VRAM and minor ( ~5% ) speed improvements if you run it in mixed precision.
Since this means you can bump up the max_len a little bit, you essentially get a small quality improvement for free.

VRAM Usage

Using the default config_ft.yml finetuning config as base, with a 5h50min dataset, training for 5 epochs.
With disabled Style Diffusion and SLM Adversarial training:
batch_size: 4 , max_len: 100 = ~22GB VRAM. Fits onto a 4090 without problems. Training took ~3 hours.
batch_size: 6, max_len: 800 = ~74GB VRAM. Fits onto an A100 without trouble. Training took ~2 hours.
With enabled Style Diffusion (epoch 2+) and disabled SLM Adversarial training:
batch_size: 4 , max_len: 100 = ~23.1GB VRAM. Fits onto a 4090. Training took <4 hours.
batch_size: 4 , max_len: 100 , using accelerate mixed_precision=fp16 = ~21GB VRAM. 4-5% speed boost.
With enabled Style Diffusion (epoch 2+) and SLM Adversarial training (epoch 4+):
batch_size: 4 , max_len: 100 = ~28GB VRAM. Impossible on a 24GB card at this batch-size.
batch_size: 4 , max_len: 100 , using accelerate mixed_precision=fp16 = ~26.6GB VRAM. Still not feasible.
batch_size: 2 , max_len: 175 , using accelerate mixed_precision=fp16 = <19GB VRAM. Fits onto a 4090. Only ran this for the Joint Training epochs.
batch_size: 4, max_len: 800 = ~76.5GB VRAM. Fits onto an A100 without trouble. Training took <3 hours.

VRAM Usage Strategies

You can be smart about VRAM usage by interrupting finetuning and resuming with other parameters:
First, run the training with Style Diffusion finetuning enabled. Set parameters that make good use of your available VRAM.
Stop the training before you reach the epoch at which SLM Adversarial Training begins. Make sure a checkpoint is saved.
Lower the batch size by half, and then resume finetuning from your last saved checkpoint for the number of epochs you intended to run with SLM Adversarial Training. Now it should fit into VRAM whereas before it did not.
That way you don't have to touch the max_len and can benefit from a relatively speedy initial training run, and suffer through less epochs with reduced batch_size to finish training.
You can half the batch_size and double the max_len to stay roughly within the same amount of utilized VRAM, if you keep all other parameters the same. VRAM usage grows a little on subsequent epochs, so keep some spare capacity for long runs.
Halving the batch_size will roughly double the time it takes to run an epoch, but won't negatively impact quality.
Halving the max_len will negatively impact quality. If you can, reduce the batch_size instead.
If you have access to a GPU with 48 or 80GB of VRAM, just set reasonable parameters from the get-go and let it finish.
Do not go below max_len: 100 or batch_size: 2 ever.
If you are running the training/finetuning across several GPUs, you must set a batch_size that provides each GPU with at least a batch of size 2. (For example if you have 4 GPUs, then 8 is the minimal possible batch_size )

Checkpoints

No, you're not missing a checkpoint. They start their numbering at 0.
So epoch_2nd_00000.pth is your completed first epoch, and not an empty checkpoint.
If you terminate training before the first epoch completes, you will be left with no checkpoint and just a logfile.
The checkpoint size is around 1.89GB without Style Diffusion + SLM Adv.
The checkpoint size is around 2.1GB with Style Diffusion + SLM Adv.

Resuming Finetuning

Set these parameters:
pretrained_model: "Models/YourModelName/epoch_2nd_00123.pth"
load_only_params: false
This will continue from the given model checkpoint, and retain the optimizer settings.
There is no quality penalty for resuming like this. (I'm glad we're past those years of training TTS models, ahah)
Be sure to set the total number of epochs to be larger than the checkpoint epoch, or training will conclude before doing anything.

Logging

Adjust the config parameter log_dir: to point to a folder of your choice.
Give the folder a unique name for each run if you want to keep an overview, ie: "Models/MyCoolTTSModel"
Logs and checkpoints will be saved there, so ensure it is on a drive with enough space.
You can point a tensorboard at that folder to look at graphs.
Here is an example tensorboard screenshot for a 5 epochs run: https://imgur.com/7FN1zLQ

Operating system support

Finetuning on Linux works as-is.
Finetuning on Windows is possible, but you must set this enviroment variable: PYTHONUTF8=1 either system-wide, or in the terminal session you're using, before invoking the finetuning script.
You can set the variable in CMD like this: set PYTHONUTF8=1
You can set the variable in a PowerShell like this: $Env:PYTHONUTF8 = 1
You can verify that Python is using UTF-8 mode by entering a python shell and using these commands:
import sys
print(sys.flags.utf8_mode)
It will print 1 if it is enabled.
This will switch the file loading/saving operations to use UTF-8, otherwise you'll get an error about an unsupported codepage.
Also you will want to specify paths using / forward slashes, rather than the backward \ slash notation common for windows just to be on the safe side.
Mixed-precision works under Windows as well.

Hardware Requirements

CPU performance matters since a lot of data is shuffled around. 16 CPU cores are a good baseline.
Trying to run this on 8 CPU cores on a fast GPU may bottleneck the process.
Regular system RAM usage is not very high. If you have 16-ish GB system RAM, you should be fine unless your dataset is truly huge.
You cannot have enough VRAM for this. More is better. More VRAM means bigger max_len and batch_size.

Quality comparisons

Dataset: Custom dataset for Garrus from Mass Effect. 30 emotions/styles tagged as speaker IDs, total duration about 5h50min. Custom text preprocessing. Audio includes flanger, this is not a model error but quite desired for a Turian voice. Using a custom OutOfDomain text dataset for SLM AT.
Epoch: 5 , for all examples. There is still room for improvement with more epochs.
Sampling: alpha=0.3, beta=0.7, diffusion_steps=10, embedding_scale=1

Text:
You are reading a discussion page on Github, imagine that! I think the human saying is: "Git good!" Wonder why they didn't choose "that" name.

(The "quoted" words are used for extra emphasis in my dataset)

Phoneme version:
juː ɑːɹ ɹˈiːdɪŋ ɐ dɪskˈʌʃən pˈeɪdʒ ˌɔn ɡˈɪthʌb , ɪmˈædʒɪn ðˈæt ! ˈaɪ θˈɪŋk ðə hjˈuːmən sˈeɪɪŋ ɪz : `` ɡˈɪt ɡˈʊd '' ! wˈʌndɚ wˌaɪ ðeɪ dˈɪdnt tʃˈuːz `` ðˈæt '' nˈeɪm .

batch_size: 4 , max_len: 100, without style diffusion finetuning, without slm adversarial finetuning :
https://voca.ro/11DDidEhJac5

batch_size: 6, max_len: 800, without style diffusion finetuning, without slm adversarial finetuning :
https://voca.ro/18PaQ8F248Hu

batch_size: 4 , max_len: 100, with style diffusion finetuning, without slm adversarial finetuning :
https://voca.ro/1bulPARTI2mn

batch_size: 2 , max_len: 175, with style diffusion finetuning, with slm adversarial finetuning :
https://voca.ro/1aqmfuqHS51N
Since running in batchsize 2 takes forever, I only trained for 2 final epochs with slm adversarial finetuning, and prior to that, up to epoch 4 with batchsize of 4 with 100 max_len.

batch_size: 4, max_len: 800, with style diffusion finetuning, with slm adversarial finetuning :
https://voca.ro/11QGDDMhWsNU

These quality examples don't reflect the maximum quality possible and are just for illustration purposes. :>

Hope this is useful.

yl4579 · 2023-11-25T20:26:52Z

yl4579
Nov 25, 2023
Maintainer

Thanks for your detailed notes! I have included it in README. But does your RAM usage include the SLM adversarial training run?

10 replies

Kreevoz Nov 25, 2023
Author

So intriguing though. I'll give it another spin and have that SLM training kick in after an epoch or two. Always good to have more comparison data.

yl4579 Nov 25, 2023
Maintainer

It could also be because it is not that important in general. In the ablation study we found that removing style diffusion (i.e., setting alpha = 0 and beta = 0) results in the highest CMOS drop. SLM AT does help but it's not as important as style diffusion. I think you sample is a little unnatural probably because you set both of them to 0.

But even the difference between your samples with different max_len is not that obvious. I guess each of these changes could result in slight degradation (like small max_len, skip after joint_epoch etc.) and all together you get poor results #49 reported. None of them is very significant by itself.

Kreevoz Nov 25, 2023
Author

Yeah, I feel like I'm unfairly representing the model here.... you raise very fair points. I'll try experimenting with letting style diffusion and/or SLM training kick in sooner to see how that changes things.

Kreevoz Nov 26, 2023
Author

And updated! You were right, actually, both style diffusion and SLM AT together make a BIG difference. 😵 Probably should have let those things run for more than 5 epochs, but for a quick example it suffices.

The numbering of the epochs is a bit strange, the parameters for joint_epoch , diff_epoch and the checkpoint names count from 0, but the epochs themselves count from 1. Mostly a cosmetic issue, though. Doesn't feel like that deserves an issue ticket.

yl4579 Nov 26, 2023
Maintainer

I have a feeling that the emotions can be further improved if you train more epochs and set multispeaker to false and do not load diffusion model for the initialization, so you don’t need to specify the reference and set alpha or beta anymore.

martinambrus · 2023-11-26T10:00:11Z

martinambrus
Nov 26, 2023

I have question about the following point:

The espeak phonemizer distinguishes between opening and closing quote pairs and tags them differently in the phoneme transcription. The model picks up on these differences. But this does not work for stray quotes.

I tried this quickly on Colab by installing the phenomizer using this how-to and used it on the following 3 lines of text:

!echo "“Make sure they don't win.”" | espeak-phonemizer -v en-us
!echo "Make sure they don't win." | espeak-phonemizer -v en-us
!echo "\"Make sure they don't win.\"" | espeak-phonemizer -v en-us

However, the output of all 3 lines was the same:

mˌeɪk ʃˈʊɹ ðeɪ dˈoʊnt wˈɪn 
mˌeɪk ʃˈʊɹ ðeɪ dˈoʊnt wˈɪn
mˌeɪk ʃˈʊɹ ðeɪ dˈoʊnt wˈɪn

Did I miss something?

I used epitran to transliterate before and the output from it was a little bit different. Also, epitran actually left those quotes intact during its own phenomization:

“mejk ʃʊɹ ðej dɑnt wɪn.”

I'm also wondering how much of a difference there is between epitran's and espeak's output? English is not my primary language and I have very little knowledge of phonemization, so I can't tell what the difference between these 2 outputs really is. I'd be glad if anyone could shed a bit of light into this for me, too :)

3 replies

Kreevoz Nov 26, 2023
Author

Yeah. The phonemizer is a bit particular about which quotes it likes.
If you replace any sort of fancy unicode quotes ” with the bog-standard regular quotes " in a text cleaning step prior to sending the text to the phonemization function, it'll accept them.

Also you didn't get them output because you forgot to tell espeak to retain the punctuation with your test-command. Try using phonemizer which can call espeak for you.
!echo "“Make sure they don't win.”" | phonemize -b espeak -l en-us --preserve-punctuation --with-stress

The difference between the two outputs is that espeak adds stress markers to the words to help the model. Those ˌ ˈ ticks indicate primary stress. The pretrained models for StarTTS2 are using stress markers.

Kreevoz Nov 26, 2023
Author

Here. Make a colab/jupyter notebook and place this into a cell:

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

import phonemizer
global_phonemizer = phonemizer.backend.EspeakBackend(language='en-us', preserve_punctuation=True,  with_stress=True)

def text_to_phonemes(text):
  text = text.strip()
  print("Text before phonemization: ", text)
  ps = global_phonemizer.phonemize([text])
  print("Text after phonemization: ", ps)
  ps = word_tokenize(ps[0])
  ps = ' '.join(ps)
  print("Final text after tokenization: ", ps)
  return ps

text = '''Your "fancy text" goes here!''' ## EDIT THIS
text_to_phonemes(text)

This will give you the exact same end result that StyleTTS2 expects to see.

martinambrus Nov 26, 2023

Thank you, this really helped. Much appreciated :)

RillmentGames · 2023-11-29T11:23:20Z

RillmentGames
Nov 29, 2023

Thank you for the detailed write up!

It would be nice to get the default max_len 400 with everything turned on to train on a 24Gb card. Any idea if Batch_size=1 would work with gradient accumulation?
I am trying it with 16bits using accelerate launch --mixed_precision=fp16 in both first and second stage,
it helps with vram. Im on epoch 40/100 in second stage now and 16bit seems to work but LJ sounds monotonic in the 'Pred' audio samples of Tensorboard compared to the 'eval' of the 1st stage, joint training starts at epoch 50 so I'm waiting for that to make a conclusive judgement.

11 replies

RillmentGames Nov 30, 2023

@yl4579 Thank you for this amazing work!
For training, Would it be a good idea to put every utterance in its own speaker ID (even though some speaker ID's would map to the same voice) to mitigate the effect of roleplay in LibriTTS?
For example, train-clean-100\103\1241 has 4 roles with slightly different voices, the narrator, the hero, the heroine and the station clerk. It seems difficult to reliably detect the role from the audio data. I thought about maybe using an LLM to detect it from the context but I see the context is discontinuous (missing paragraphs) so the LLM may also get confused.

yl4579 Nov 30, 2023
Maintainer

@RillmentGames Unless you set a fixed number of speaker ids, the model is trained on a self-supervised way so it is not necessary. The speaker id is only used for the diffusion model training and the accuracy is not very good to be honest. So the conclusion is it doesn’t matter that much. If every sentence is its own speaker id, the diffusion model is trained to reconstruct the style instead and the variation will be low.

RillmentGames Dec 1, 2023

@Kreevoz To follow-up, I tried the gradient checkpointing but no luck. It seems the standard Huggingface Transformer package Trainer class is needed but StyleTTS2 doesnt use it. So I tried sticking in
import torch.utils.checkpoint as checkpoint
x=checkpoint(g(x),x)
wherever I could but it turned out to make training crash in almost all cases. The only time I got it to work was by putting x = checkpoint(l(x),x) in here but it looks like it saved almost no vram, and then I still have to test if it trains correctly while using it ☹️

So next I tried batch 1 but got an error ending with: File "F:\Experiment\StyleTTS2\Modules\diffusion\modules.py", line 389, in run x = torch.cat([x.expand(-1, embedding.size(1), -1), embedding], axis=-1) RuntimeError: The expanded size of the tensor (67) must match the existing size (256) at non-singleton dimension 1. Target sizes: [-1, 67, -1]. Tensor sizes: [256, 256, 1] Traceback (most recent call last): File "C:\Users\rill\.pyenv\pyenv-win\versions\3.10.5\lib\runpy.py", line 196, in _run_module_as_main I think something in the random mask code of the diffusion module is depends on the batch size.

Only positive I can report is that the enhanced LJ dataset seems to work in joint training as well, audio quality is improving now so LibriTTS-R should also work, but I couldnt find a way to do it on 24GB without lowering max_len ☹️

Kreevoz Dec 2, 2023
Author

@RillmentGames mmh. yeah batch_size of 2 is the absolute minimum. I just rent GPU pods in the cloud to do finetuning currently. Finetuning it locally would be my preference too... Wonder if the mixed precision could be further lowered. 🤔
Oh it does great on all sorts of datasets, yeah! To my delight, StarTTS2 is a lot more robust for datasets that include a lot of screaming and angry talk. xD I am tempted to look into the suggestion yl4579 made in regards to swapping out the style encoder since I don't really need zero-shot features with unknown speakers.

GUUser91 May 20, 2024

@RillmentGames
The vokan model is trained on LibriTTS-R.
https://huggingface.co/ShoukanLabs/Vokan

devidw · 2023-12-01T17:59:39Z

devidw
Dec 1, 2023

Thanks so much for the super useful notes @Kreevoz - much appreciated!

We are currently dealing with artifacts at the end of generated audio.

Artifacts seem to be - in part - the result of a too short max_len (in addition to poor audio cleanup and a low quality transcription of course)
At a max_len of 100 (=1.25 seconds), finetuning is possible, but the start and end of generated audio may accumulate distortion and pops.
I have not tried 200 yet. The default of 400 seems sensibly chosen.
At a max_len of 800 (=10 seconds), quality is excellent even after one epoch and improves on subsequent iterations. This length covers the majority of audio datasets (as you know the free standard datasets adhere to the duration limitations that autoregressive models like tacotron 1/2 established years ago - due to their attention-mechanism imploding after 10-12 seconds)
Providing a clean reference audio file to compute the style helps a great deal in mitigating artifacts.

We did a single speaker fine-tune with max_len=100 and got artifacts at the end: https://voca.ro/1fxqUN6Tj2US (~4s)

We were able to get rid of those with max_len=400: https://voca.ro/193XDB1sOpaU (~4s)

However, for longer audio at ~10s we still get them: https://voca.ro/1odqL5a7BC89 (~10s)

Wondering if this would be fixed if we fine-tune on max_len=800, but then wondering if the issue will pop up again for even longer audio like ~15s

Would that be something to expect based on how the model learns to produce the audio within the max_len window, or is there a sweet spot to get rid of most of the artifacts no matter the length of the audio?

Thx 🙏

18 replies

Kreevoz Dec 21, 2023
Author

Ok. This is surprising! I've linked to yl4579's comment in the notes text above so interested readers can make their own decision what to use.
Seeing how adding one's own token and silence duration doesn't break anything, I'll stick to my redundant secondary implementation xD I'd guaranteed forget that the dollar sign has two functions otherwise.

GUUser91 Jan 20, 2024

I want to give a tip for those that want to add a batch of 100ms silence in the audio files. Use audacity and use trim extend via the macro manager. Before you execute the macro manager, be sure to go to Edit > Preferences > Audio Settings and set the project sample rate to 24000.

Latrolage Mar 28, 2024

Honestly, I just realized that I have already done this. The $ is the token I used as and , which corresponds to 5000 samples of silence at the beginning and ending of a speech sample, so if you do wav[..., 5000:-5000], it will automatically remove the silence. It works for both base model and finetuning. There is no need to add another EOS token.

So should I be adding the $ to the start and end of my dataset transcription before or after running the phonemizer? Are you saying this code automatically adds samples of silence if I use $ or should I add it myself?

I'm not sure what I should be doing as a user.

Sweetapocalyps3 Apr 2, 2024

Hi guys, just a quick question. When you say to add a token to the end of the sentences, you mean the original sentences from where the train_list.txt phonemized file will be created or you mean to put '...' directly in the train_list.txt and val_list.txt files?

Thank you!

qwertyflagstop Apr 5, 2024

@Kreevoz I might have an explanation as to why the built in pad tokens $ + 5000 samples of padding doesn't work for the LibriTTS base model and your fix works. Assuming the LibriTTS dataset has the end of clip noise artifacts, that noise would be noise BEFORE the $ token. In Meldataset.py we see this line

wave = np.concatenate([np.zeros([5000]), wave, np.zeros([5000])], axis=0)

Once you did your finetuning with no blip sound u should have something like

wave = np.concatenate([np.zeros([5000]), wave_with_no_artifacts, 100_ms_of_zeros, np.zeros([5000])], axis=0)

and your input tokens would look like $,all_the_phonemes_tokens,«,$ . My only question for you is when you crop your output at inference time did you have to to clip just 100ms of silence or 100ms of silence + 5000?

Moonmore · 2023-12-14T02:27:34Z

Moonmore
Dec 14, 2023

I used 4 V100 16GB gpu with a batch size of 8 and max len of 200 with my dataset. OOM to the second stage training.
So I should reduce max len？
Thanks a lot.

3 replies

Kreevoz Dec 14, 2023
Author

8 / 4 = you already use a batch size of 2 per individual GPU, which is the mininum. Smaller and you'll outright crash. You could try reducing the max_len though.

78Alpha Dec 20, 2023

The batch of 2 per and 200 max length takes 19 GB. you'll need to lower that max length by a bit.

Moonmore Dec 21, 2023

The batch of 2 per and 100 max_len takes OOM. I should reduce to the max length?

georgedei · 2023-12-26T05:57:43Z

georgedei
Dec 26, 2023

Is it possible to grow and add new styles to the multi speaker model by repeatedly fine tuning with a small data set? For example, adding a variety of British speakers of different accents with separate fine tuning sessions.
In other words, why is huge data set needed for first training? Why not train in batches of smaller data sets?

1 reply

78Alpha Jan 10, 2024

Like stable diffusion models, repeatedly finetuning will make it forget old information. Kind of like having someone that is 100 remember when they were 5. Might be there but not as clear. Doing the training all at once with different speaker IDs will ensure it can differentiate between them all and not forget things.

junylee11 · 2024-01-15T02:05:08Z

junylee11
Jan 15, 2024

Is it okay if the batch_size and max_len values of 'first_stage' and 'second_stage' are different?

2 replies

martinambrus Jan 15, 2024

From what I understand, it's okay (and usually even neccessary) to lower the batch_size value for stage 2, since second stage requires a lot more VRAM from your GPUs.

What max_len does is it says how much of your audio files is actually processed (i.e. what max length of any audio file). So I think it's best to keep max_len at the same value for both stages, even if it was a bit lower. Otherwise the first stage model would have been basically trained on different data (i.e. on longer WAV file durations) than the second stage, so I'm not sure that the 2nd stage model can utilize first stage model's learning completely (if at all).

But I'd also like to see an answer from someone more experienced with this than myself.

junylee11 Jan 15, 2024

Thank you!

effusiveperiscope · 2024-01-15T07:04:31Z

effusiveperiscope
Jan 15, 2024

Can VRAM usage be lowered by lowering batch_percentage, and is there a compromise in quality?

1 reply

teamblubee Jan 20, 2024

pre-training is fine on 16GB but that will not be enough to train the model. I was able to train with a 4090 mobile with 16GB it took about 2 days to train to 75 epocs.

georgedei · 2024-01-15T08:56:20Z

georgedei
Jan 15, 2024

With a 4090 card we are limited to a max_length of about 280. Does that mean that only the first 3.5 seconds of audio are utilized? And then it wouldn't make any sense to include longer training sentences? 3.5 seconds is a really small context window for text styling. Or perhaps longer samples are split up?

1 reply

teamblubee Jan 20, 2024

a 16GB card can do the pre-training but I couldn't find a configuration that would successfully train with 16GB Gpu.

junylee11 · 2024-01-16T02:27:38Z

junylee11
Jan 16, 2024

How much data is required when fine-tuning a fully learned pre-learning model?
I know the more, the better, but I want a guideline that says it should be more than a few minutes.

1 reply

teamblubee Jan 20, 2024

What use case would you use to fine tune a fully trained model without understanding how much data was required to train the model?

teamblubee · 2024-01-20T16:11:46Z

teamblubee
Jan 20, 2024

This system is very robust at generating a large amount of text. There are a few issues with the pops at the end. My speculation is that when the system tries to generate a voice sample and there is no data, it will generate noise or static.

1 reply

korakoe Jan 30, 2024

I'm pretty sure this is a result of audio data being cut off due to max len... if you have a max_len that is longer than the audio it doesn't have popping artifacts

rikabi89 · 2024-02-23T08:41:59Z

rikabi89
Feb 23, 2024

I was wondering is it worth trying FP8 for training? I struggled to build the wheel on WSL and wonder if it's worth the effort anyway? Are there any benefits?

4 replies

78Alpha Mar 11, 2024

Using FP16 only brought the requirement down by a gigabyte or two. FP8, while it does crash currently, might be able to drop it by another gigabyte or two. Could allow a 3090 or 4090 to get batch size 3 with max-Len 220 or batch size 2 with max_len 300-400.

GUUser91 Jun 1, 2024

@78Alpha
So there's no way to fix this error message?

Traceback (most recent call last):
File "/run/media/user/e1745494-af46-4749-9e1a-89d2b2289699/StyleTTS2/train_finetune_accelerate.py", line 714, in
main()
File "/run/media/user/e1745494-af46-4749-9e1a-89d2b2289699/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/run/media/user/e1745494-af46-4749-9e1a-89d2b2289699/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/run/media/user/e1745494-af46-4749-9e1a-89d2b2289699/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/run/media/user/e1745494-af46-4749-9e1a-89d2b2289699/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/run/media/user/e1745494-af46-4749-9e1a-89d2b2289699/StyleTTS2/train_finetune_accelerate.py", line 246, in main
model, optimizer, train_dataloader = accelerator.prepare(
File "/run/media/user/e1745494-af46-4749-9e1a-89d2b2289699/StyleTTS2/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1274, in prepare
if tpu_should_fix_optimizer or (self.mixed_precision == "fp8" and self.fp8_recipe_handler.backend == "TE"):
AttributeError: 'NoneType' object has no attribute 'backend'
Traceback (most recent call last):
File "/run/media/user/e1745494-af46-4749-9e1a-89d2b2289699/StyleTTS2/venv/bin/accelerate", line 8, in
sys.exit(main())
File "/run/media/user/e1745494-af46-4749-9e1a-89d2b2289699/StyleTTS2/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/run/media/user/e1745494-af46-4749-9e1a-89d2b2289699/StyleTTS2/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1082, in launch_command
simple_launcher(args)
File "/run/media/user/e1745494-af46-4749-9e1a-89d2b2289699/StyleTTS2/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 688, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/run/media/user/e1745494-af46-4749-9e1a-89d2b2289699/StyleTTS2/venv/bin/python3.10', 'train_finetune_accelerate.py', '--config_path', './Configs/config_ft-ellie.yml']' returned non-zero exit status 1.

78Alpha Jun 1, 2024

Not that I'm capable of figuring out.

GUUser91 Jun 6, 2024

I got fp8 to work, or at least I got it to finish a epoch. This is how I got it to work. Note I'm using a 4090 on artix linux, a variation of arch linux.
I export this to the commandline.

export NVTE_FRAMEWORK=pytorch

I installed the wheel

pip install wheel

I installed this.

sudo pacman -S openmpi

I installed the nigthly cuda 12.4 pytorch builds

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124

I install Transformer Engine

pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable

I export this to the commandline

export CUDA_LAUNCH_BLOCKING=1

Now you can train with fp8.

accelerate launch --mixed_precision=fp8 train_finetune.py --config_path ./Configs/config_ft.yml

Edit: I'm using cuda-12.4.0-2 that I downloaded from this link
https://archive.org/download/archlinux_pkg_cuda
To fix the /opt/cuda/bin/gcc: No such file or directory error message, click on this link.
https://gitlab.archlinux.org/archlinux/packaging/packages/cuda/-/issues/8

The problem is that the files /opt/cuda/bin/gcc and /opt/cuda/bin/g++ do not exist anymore. I could fix it by linking gcc-13 and g++-13 manually:

ln -s /usr/bin/gcc-13 /opt/cuda/bin/gcc
ln -s /usr/bin/g++-13 /opt/cuda/bin/g++

Be sure to use a xfce4 session so you can save as much vram as possible.

Edit: Maybe it's because I'm using vokan as the base model but with fp8 training I could only set max-len to 286, anymore and I'll get a out of vram error message.

WHalcyon · 2024-02-27T10:25:03Z

WHalcyon
Feb 27, 2024

Thanks for the write-up, I'll post my experiences here.

I tried fine tuning on a single A100 80gb with a much smaller dataset (~30 mins), settings of batch_size 4 and max_len 800. All other settings default except fp16 and multispeaker=false.

Went OOM within a couple epochs. With batch_size 3 it capped at around 73gb vram. Epochs took in the 5-10 minute range but my dataset was small. With style diffusion and adversarial training vram usage increased to 79.5gb. Total 25 epochs, took under 4 hours.

I'm unsure if you can finetune on multiple GPUs so I would be cautious with your settings if you haven't prepared your data yet. If I make another dataset I'll limit my clips to 6-8 seconds each. The VRAM requirements for training are immense, but I have seen people use much longer clips so there may be a variable I missed.

The inference quality is very good, I didn't focus on emotion transfer but it still came out great, excellent vocal quality and prosody. There are some unnatural pauses but some of that may be related to the punctuation in my training set, I included a few too many dashes and the model seemed to associate those with pauses.

The only issue is occasional artifacts at the end of generated clips as others have noted, I assume this is learned from the Libri dataset because I included ample leading and trailing silence and kept all my clips under the max_len of 10s. I am curious if anyone has managed to train this behavior out from the Libri checkpoint. Overall it's not a big deal, I may consider just splicing out the last 100ms of all generations.

2 replies

devidw Feb 27, 2024

I am curious if anyone has managed to train this behavior out from the Libri checkpoint.

A workaround is discussed here: #81 (comment)

WHalcyon Feb 27, 2024

I'll try adding a stop token next time, I thought training on properly segmented samples would help. I've noticed that it doesn't happen (or happens rarely) with generations under 10 seconds so it probably did something.

Latrolage · 2024-03-29T04:59:54Z

Latrolage
Mar 29, 2024

30 emotions/styles tagged as speaker IDs

So if I do the same and tag my data with different styles/emotion dose that mean I'd have to select a specific speaker from the model at inference time and that whole inference will only output that style? If I tag everything under the same speaker, would the model just try to get an average all the styles?

0 replies

GUUser91 · 2024-04-20T10:53:46Z

GUUser91
Apr 20, 2024

Something I noticed while using inference with the base LibriTTS model. If I denoise the reference audio with https://github.com/resemble-ai/resemble-enhance
(Gradio app version, not commandline version), I can get output files that sound closer to the reference audio.
Here's are two examples.
Reference audio
https://vocaroo.com/1lKtWuf0Jmol
Output file.
https://vocaroo.com/13VY3rpyJcuM
Output file after denoising the Reference audio with resemble enhance, I also reduced the reverb of the reference audio by using the Dialogue De-Reverb tool via izotope rx 10.
https://vocaroo.com/11L6YAga7jnO

Reference audio
https://vocaroo.com/1hC3GldKQR9c
Output files.
https://vocaroo.com/13JdgdVaDNOq
https://vocaroo.com/184vTRtsPuqz
Output files after denoising the Reference audio with resemble enhance, I also reduced the reverb of the reference audio by using the Dialogue De-Reverb tool via izotope rx 10.
https://vocaroo.com/1aKPfaXGjGNn
https://vocaroo.com/1f4VoDUB5aH9

8 replies

GUUser91 Apr 27, 2024

@danablend
I found a different base model you might be intereseted in using.
https://huggingface.co/ShoukanLabs/Vokan

Vokan is an advanced finetuned StyleTTS2 model crafted for authentic and expressive zero-shot performance. Designed to serve as a better base model for further finetuning in the future! It leverages a diverse dataset and extensive training to generate high-quality synthesized speech. Trained on a combination of the AniSpeech, VCTK, and LibriTTS-R datasets, Vokan ensures authenticity and naturalness across various accents and contexts. With over 6+ days worth of audio data and 672 diverse and expressive speakers, Vokan captures a wide range of vocal characteristics, contributing to its remarkable performance.

Edit: Here's a model I fine tuned with the Vokan model. This was trained with only 15 minutes. I set max_len to 260. I denoise the audio dataset with resemble enhance (gradio app version), then I used the voice denoise tool with Izotope RX 10.
Reference audio:
https://vocaroo.com/1n5ioH31CoAn
Output:
https://vocaroo.com/1nN1hare6fFH
Syle Diffusion and SLM Adversarial Training Output. This was trained at 75 epochs, it was supposed to go up to 100 epoch but it crashed:
Embedding_scale=1:
https://vocaroo.com/11iuO3GVpNwd
Embedding_scale=2:
https://vocaroo.com/18oKlQDfsBbi

Here's a second model. This was trained with only 4 minutes because the video game character only has 4 minutes of dialogue. Max_Len is set to 280. I used adobe enhance speech on the data audio files, then I used the Accentize dxRevive Pro and Izotope RX 10 Voice Denoise plug-ins via audacity.
Reference audio:
https://vocaroo.com/1i9qC0GE5DkJ
Output:
https://vocaroo.com/1omkkbxCHG2W
This model was trained with style diffusion and SLM Adversarial Training.
Output:
https://vocaroo.com/1bkO70JxRCaC
https://vocaroo.com/1kPkufYi1Kwx

Here's a third example. This was trained with only 3 minutes because the video game character only has 3 minutes of dialogue. Max_Len is set to 280. I used adobe enhance speech on the data audio files, then I used the Accentize dxRevive Pro and Izotope RX 10 Voice Denoise plug-ins via audacity.
Reference Audio:
https://vocaroo.com/1enVslMd1ys7
Output:
https://vocaroo.com/18Pb7eAUjMCA
SLM:
Here's a fourth example. This was trained with only 4 minutes because this restaurant mascot only appeared in six ads. Max_Len is set to 252. slmadv_params min_len is set to 180. slmadv_params max_len is set to 190. I use resemble enhance (Gradio App version) on the data audio files, then I used the Acon Digital DeVerberate 3 plug-in via audacity.
Reference Audio:
https://vocaroo.com/1lUeJhCRUs5t
Output:
https://vocaroo.com/19xltZatlBiw
Fifth example. This was trained with 12 minutes of audio. Max_Len is set to 250. slmadv_params min_len is set to 180. slmadv_params max_len is set to 190. I used resemble enhance (Gradio App version) on the data audio files, then I used the Acon Digital DeVerberate 3 and Izotope RX 10 Voice Denoise plug-ins via audacity.
Output:
https://vocaroo.com/12ZcquEiIkf6
Sixth example. This was trained with 13 minutes of audio. Max_len is set to 250. I used resemble enhance (Gradio App version) on the data audio files, then I used the Acon Digital DeVerberate 3 and Izotope RX 10 Voice Denoise plug-ins via audacity.
Reference Audio:
https://vocaroo.com/1gQa3FzSsruo
https://vocaroo.com/1kx1rVFefBai
Output:
https://vocaroo.com/1nwZdOgH1MC6
https://vocaroo.com/18evuK0CIdWZ
I noticed I can enhance the styletts2 output file with the audacity Macro Manager consisting of the plug ins: Acon Digital DeVerberate 3 (set to remove reverb), dxRevive Pro (set to 30%), and RX 10 Voice De-Noise.
Output:
https://vocaroo.com/1gjsuP9gi6IR
https://vocaroo.com/1kUmnXadYvQ4
This one was trained with both style diffusion and SLM Adversarial Training.
Output:
https://vocaroo.com/12HkeK4gzsSA
https://vocaroo.com/1g7VCGHwbCr4
https://vocaroo.com/160XI6xR5Vm0

Seventh example. This was trained with 7 minutes of audio. Max_Len is set to 260. I used resemble enhance (Gradio App version) on the data audio files, then I used the Acon Digital DeVerberate 3 and Izotope RX 10 Voice Denoise plug-ins via audacity.
Reference Audio:
https://vocaroo.com/1jsskYhCoebB
Output:
https://vocaroo.com/1cSjegQAL7fl
This was trained with Syle Diffusion and SLM Adversarial training. batch_size is set to 2. batch_percentage is set to 1. slmadv_params min_len is set to 180. slmadv_params max_len is set to 190.
Output:
https://vocaroo.com/18WhTdPOvnOB
Eighth example. This was trained with 11 minutes of audio. Max_Len is set to 252. I used resemble enhance (Gradio App version) on the data audio files, then I used the Acon Digital DeVerberate 3 and Izotope RX 10 Voice Denoise plug-ins via audacity. This was trained with Syle Diffusion and SLM Adversarial training. Batch_Size is set to 2. batch_percentage is set to 1. slmadv_params min_len is set to 190. slmadv_params max_len is set to 200.
Reference Audio:
https://vocaroo.com/1aWphdGzOg1r
Embedding_scale=1 output:
https://vocaroo.com/1jtDdz1KDU8m
Embedding_scale=2 output:
https://vocaroo.com/17iFlpeKvp5Q
Reference Audio:
https://vocaroo.com/1nAW8J7QCA99
Output:
https://vocaroo.com/1go5CohnF7hT

Nineth example. This was trained with 21 minutes of audio. Batch_Size is set to 2. batch_percentage is set to 1. Max_Len is set to 252. slmadv_params min_len is set to 180. slmadv_params max_len is set to 190. I used resemble enhance (Gradio App version) on the data audio files, then I used the Acon Digital DeVerberate 3.
Output:
https://vocaroo.com/1iZgi9jm0FKc

Tenth example. This was trained with 7 minutes of audio. Batch_Size is set to 2. batch_percentage is set to 1. Max_Len is set to 252. slmadv_params min_len is set to 180. slmadv_params max_len is set to 190. I used resemble enhance (Gradio App version) on the data audio files, then I used the Acon Digital DeVerberate 3.
Output:
https://vocaroo.com/1cfPpkmuKOsj

danthegoodman1 Jun 5, 2024

@GUUser91 how long did these training runs take and on what hardware?

GUUser91 Jun 5, 2024

@danthegoodman1
30 to 150 minutes on 4090.

danthegoodman1 Jun 5, 2024

Thanks! Do you happen to have the config you used as well still?

GUUser91 Jun 5, 2024

@danthegoodman1
No.

DogeLord081 · 2024-04-21T03:28:48Z

DogeLord081
Apr 21, 2024

Does anyone know why SLM adverserial training doesn't start when finetuning as shown in this issue? #227 I think it may have to do with the batch size and max_length, as I and the OP of this issue both had a batch size of 2 with max_length 400, but I haven't done any further testing on this yet. Here's my full config:

log_dir: "/content/Models/MITSUHA"
save_freq: 5
log_interval: 10
device: "cuda"
epochs: 100 # number of finetuning epoch (1 hour of data)
batch_size: 2
max_len: 400 # maximum number of frames
pretrained_model: "/content/Models/LibriTTS/epochs_2nd_00020.pth"
second_stage_load_pretrained: true # set to true if the pre-trained model is for 2nd stage
load_only_params: true # set to true if do not want to load epoch numbers and optimizer parameters

F0_path: "/content/StyleTTS2/Utils/JDC/bst.t7"
ASR_config: "/content/StyleTTS2/Utils/ASR/config.yml"
ASR_path: "/content/StyleTTS2/Utils/ASR/epoch_00080.pth"
PLBERT_dir: '/content/StyleTTS2/Utils/PLBERT/'

data_params:
  train_data: "/content/StyleTTS2/Data/train_list.txt"
  val_data: "/content/StyleTTS2/Data/val_list.txt"
  root_path: "/content/StyleTTS2/Data/wavs"
  OOD_data: "/content/StyleTTS2/Data/OOD_texts.txt"
  min_length: 50 # sample until texts with this size are obtained for OOD texts

preprocess_params:
  sr: 24000
  spect_params:
    n_fft: 2048
    win_length: 1200
    hop_length: 300

model_params:
  multispeaker: true

  dim_in: 64 
  hidden_dim: 512
  max_conv_dim: 512
  n_layer: 3
  n_mels: 80

  n_token: 178 # number of phoneme tokens
  max_dur: 50 # maximum duration of a single phoneme
  style_dim: 128 # style vector size
  
  dropout: 0.2

  # config for decoder
  decoder: 
      type: 'hifigan' # either hifigan or istftnet
      resblock_kernel_sizes: [3,7,11]
      upsample_rates :  [10,5,3,2]
      upsample_initial_channel: 512
      resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]]
      upsample_kernel_sizes: [20,10,6,4]
      
  # speech language model config
  slm:
      model: 'microsoft/wavlm-base-plus'
      sr: 16000 # sampling rate of SLM
      hidden: 768 # hidden size of SLM
      nlayers: 13 # number of layers of SLM
      initial_channel: 64 # initial channels of SLM discriminator head
  
  # style diffusion model config
  diffusion:
    embedding_mask_proba: 0.1
    # transformer config
    transformer:
      num_layers: 3
      num_heads: 8
      head_features: 64
      multiplier: 2

    # diffusion distribution config
    dist:
      sigma_data: 0.2 # placeholder for estimate_sigma_data set to false
      estimate_sigma_data: true # estimate sigma_data from the current batch if set to true
      mean: -3.0
      std: 1.0
  
loss_params:
    lambda_mel: 5. # mel reconstruction loss
    lambda_gen: 1. # generator loss
    lambda_slm: 1. # slm feature matching loss
    
    lambda_mono: 1. # monotonic alignment loss (TMA)
    lambda_s2s: 1. # sequence-to-sequence loss (TMA)

    lambda_F0: 1. # F0 reconstruction loss
    lambda_norm: 1. # norm reconstruction loss
    lambda_dur: 1. # duration loss
    lambda_ce: 20. # duration predictor probability output CE loss
    lambda_sty: 1. # style reconstruction loss
    lambda_diff: 1. # score matching loss
    
    diff_epoch: 10 # style diffusion starting epoch
    joint_epoch: 30 # joint training starting epoch

optimizer_params:
  lr: 0.0001 # general learning rate
  bert_lr: 0.00001 # learning rate for PLBERT
  ft_lr: 0.0001 # learning rate for acoustic modules
  
slmadv_params:
  min_len: 400 # minimum length of samples
  max_len: 500 # maximum length of samples
  batch_percentage: 0.5 # to prevent out of memory, only use half of the original batch size
  iter: 10 # update the discriminator every this iterations of generator update
  thresh: 5 # gradient norm above which the gradient is scaled
  scale: 0.01 # gradient scaling factor for predictors from SLM discriminators
  sig: 1.5 # sigma for differentiable duration modeling

9 replies

GUUser91 May 13, 2024

I figured out how to start SLM adverserial training. Just follow these links.
#227 (comment)
#227 (comment)
SLM loss stats won't appear during training. But it will appear in the tensorboard folder.

godspirit00 May 14, 2024

@GUUser91 Thanks for the information! Did you compare the quality of the generated speech? Does it sound better after the SLM training?

GUUser91 May 16, 2024

@godspirit00
I couldn't find any difference the first time I did this. I believe I did something wrong. So I deleted the model. So I tinkered around with the config_ft.yml file. I set Max_Len to 120. I set batch_percentage to 1. I set slmadv_params min_len to 100 and slmadv_params max_len to 120. Batch size is set to 2. Now the DiscLM and GenLM Loss stats are no longer at 0. I'm using a rx 7900 xtx. Note I'm training a model with style diffusion in one fine-tuning session and adversarial training in another session.

Here a pic from my tensorboard folder.

Edit: I discovered I can do style diffusion and SLM adversarial training together in one session. I set max_len to 280, epoch set to 100, batch_size set to 2, batch_percentage set to 1, slmadv_params min_len set to 180, slmadv_params max_len set to 200, diff_epoch to 10, joint_epoch to 50, I'm using the vokan model as the base model. This was trained with only 2 minutes because this character only has 2 minutes of dialogue (Princess Zelda from OMOCAT web comic).
Style diffusion only:
https://vocaroo.com/1auQKT89gg23
https://vocaroo.com/1jy8VcWUZFiv
Style diffusion and SLM adversarial training:
https://vocaroo.com/17ZhWhIIihNe
https://vocaroo.com/1m3tZkC14aUd
Here's another model I tried making, Goob from Meet the Robinson saying Zelda OMOCAT lines. I changed slmadv_params min_len set to 200 and slmadv_params max_len to 250. The rest is the same as the previous settings.
Reference Audio:
https://vocaroo.com/1kRLEnhRoA4u
Style diffusion only:
https://vocaroo.com/1a8tB0PxW4Ot
Style diffusion and SLM adversarial training:
https://vocaroo.com/1mLA9V3Yii9T

Second Edit:
I rented out a h100 from runpod again. I edit the config_ft.yml file with micro. I set batch_size to 4 and max_len to 500. I set slmadv_params min_len to 100 and slmadv_params max_len to 500 and batch_percentage to 1. And now SLM adversarial training has started to work.

Here's a screenshot of the vram usage.

This is what I did in runpod.
I update the repo.

apt update

I install these.

apt install aria2 p7zip-full curl jq micro

I use the pwd command to find directory / filepath infomation.

pwd

I put the training dataset in a zip file and then I upload it to either https://catbox.moe/ or https://litterbox.catbox.moe/ (Which lets you upload a 1GB file)

I download the vokan base model and zip file with aria2.

aria2c -x 16 -s 16 -k 1M https://archive.org/download/epoch_2nd_00012/epoch_2nd_00012.pth

aria2c -x 16 -s 16 -k 1M https://files.catbox.moe/XXXXXXXX.zip

I unzip the file

7z x XXXXXXXX.zip

I download the gofile upload script file.

aria2c https://raw.githubusercontent.com/Sushrut1101/GoFile-Upload/master/upload.sh

I give the script permissions

chmod +rx upload.sh

I upload the pth file to https://gofile.io/

./upload.sh model.pth

godspirit00 May 16, 2024

@GUUser91 Thanks again for the information! That's very helpful

GUUser91 Jun 6, 2024

I tinkered around with config_ft.yml file and I trained with these settings on my 4090. I installed the latest version of bitsandbytes. Batch_size is set to 2. batch_percentage is set to 1. This was trained with 7 minutes of audio. The prompt is: Huh. Maybe she got off work later than I thought.

Max_Len is set to 252. slmadv_params min_len is set to 180. slmadv_params max_len is set to 190.
https://vocaroo.com/1naWqrrK3Via
https://vocaroo.com/1gJfNFoAkErL

Max_Len is set to 252. slmadv_params min_len is set to 252. slmadv_params max_len is set to 252.
https://vocaroo.com/1mc7AQst5t4u
https://vocaroo.com/1hdoRs1MaLQG

Max_Len is set to 260. slmadv_params min_len is set to 260. slmadv_params max_len is set to 260.
https://vocaroo.com/1dcr7vhTDXUP
https://vocaroo.com/18S9JyR905XH

Max_Len is set to 280. slmadv_params min_len is set to 280. slmadv_params max_len is set to 280.
https://vocaroo.com/1fQaJFghUi01
https://vocaroo.com/1fiq3ubdGEYv

georgedei · 2024-04-24T23:56:31Z

georgedei
Apr 24, 2024

tracing back further, just a quick look:
s_preds -> s_trg traces to train_finetune line 321 "Denoiser training"
have to insert some diagnostic print out to see where it fails

1 reply

DogeLord081 Apr 26, 2024

Have you found anything further yet?

GUUser91 · 2024-05-18T18:24:36Z

GUUser91
May 18, 2024

I don't understand max_len number when converted to seconds. Like how much is max_len 190 when converted to seconds and milliseconds.

2 replies

78Alpha May 18, 2024

2.375 seconds. Take your max length and multiple by 0.0125.

GUUser91 May 18, 2024

Thanks! 👍

junylee11 · 2024-07-01T14:57:04Z

junylee11
Jul 1, 2024

Hello, we created our own character model through English model Fine Tuning.
The data used for learning is an emotionless voice file.
I recorded one file each containing 'happiness', 'angry' and 'sad' feelings, and I put it in compute_style() to proceed with style transfer.
However, emotions (style) are not reflected well in the results.
What's the reason?

What should I do for emotional tts?

(alpha = 0.1, beta = 0.5, diffusion=10) same with demo

0 replies

DevOps920719 · 2024-07-05T16:08:23Z

DevOps920719
Jul 5, 2024

@Kreevoz could you please guide me how to train styletts2 model in python?
Thank you.

0 replies

auspicious3000 · 2024-07-15T05:09:32Z

auspicious3000
Jul 15, 2024

@Kreevoz Thanks for the detailed notes. I have a question regarding the following comment:

The LibriTTS dataset has poor punctuation and a mismatch of spoken/unspoken pauses with the transcripts.

Since the model in the paper was trained using LibriTTS as-is with excellent results, does this mean that this issue might not significantly affect the audio quality? I'd appreciate your insights on this.

1 reply

martinambrus Sep 2, 2024

I'd say that the audio quality might be good but you'd likely experience weird pauses where they shouldn't be (like after a hyphen /-/) or a muted dollar sign ($), since either that character or its text representation (dollar) is miss-trained as silence.

jaaari · 2024-11-03T10:25:30Z

jaaari
Nov 3, 2024

Thanks also for all the info!

We are actually looking for someone to dockerize this whole finetuning process for a project of ours, so if anyone who had some successful finetunes is interested in work with us on this for a small compensation, please email me at [email protected] :)
And we would of course keep the result open source, to make finetuning this model more accessible.

0 replies

Notes on Finetuning #81

Teaching the model new features

Text dataset quality

Robustness

Artifacts

Finetune training Stages

Base

Style Diffusion

SLM Adversarial Training

Errors, Crashes

Mixed-precision Training

VRAM Usage

VRAM Usage Strategies

Checkpoints

Resuming Finetuning

Logging

Operating system support

Hardware Requirements

Quality comparisons

Replies: 22 comments · 79 replies

yl4579 Nov 25, 2023 Maintainer

Kreevoz Nov 25, 2023 Author

yl4579 Nov 25, 2023 Maintainer

Kreevoz Nov 25, 2023 Author

Kreevoz Nov 26, 2023 Author

yl4579 Nov 26, 2023 Maintainer

Kreevoz Nov 26, 2023 Author

Kreevoz Nov 26, 2023 Author

yl4579 Nov 30, 2023 Maintainer

Kreevoz Dec 2, 2023 Author

Kreevoz Dec 21, 2023 Author

Kreevoz Dec 14, 2023 Author

Replies: 22 comments 79 replies

yl4579
Nov 25, 2023
Maintainer

Kreevoz Nov 25, 2023
Author

yl4579 Nov 25, 2023
Maintainer

Kreevoz Nov 25, 2023
Author

Kreevoz Nov 26, 2023
Author

yl4579 Nov 26, 2023
Maintainer

Kreevoz Nov 26, 2023
Author

Kreevoz Nov 26, 2023
Author

yl4579 Nov 30, 2023
Maintainer

Kreevoz Dec 2, 2023
Author

Kreevoz Dec 21, 2023
Author

Kreevoz Dec 14, 2023
Author