Diffusion sampler / Bert #169

georgedei · 2023-12-21T06:23:40Z

georgedei
Dec 21, 2023

When comparing code of original distro with yl4579, the inference method has changed by adding a diffusion sampler and the Bert model. I have been puzzling over it. If you could, please shed some light on why this method was added, and what is achieved by it. Much appreciated, it is difficult to learn about these subjects.

Looking closer, I can see that a tensor is produced by the sampler function, then the tensor is split into two parts, s and ref, but ref seems to be unused. Would it be possible to get some comments on that part of the code?

yl4579 · 2023-12-22T06:29:53Z

yl4579
Dec 22, 2023
Maintainer

BERT model is validated in the PL-BERT paper and repo, so you can refer to https://github.com/yl4579/PL-BERT for more details.

The use of diffusion model is more akin to stable diffusion (latent diffusion model), just the latent variable is a style vector. You can think of StyleTTS as an autoencoder where the style encoder encodes the speech into a latent space (style) and then the speech is reconstructed from the latent style. The diffusion model turns StyleTTS into a probabilistic generative model that samples the style directly.

The style has two parts, one is acoustic (ref) and another is prosodic (s), which encodes different aspects of the style (longer time scale for prosodic styles, shorter time scale for acoustic styles). You can read the paper for more details

ref should be used, could you be more specific about the part you think it’s not used?

1 reply

georgedei Dec 23, 2023
Author

This is part of the inference code, looking more carefully I see that ref tensor is used in model_decoder(), and s tensor is used in model.predictor.text_encoder(). Would you mind a few words to comment on these two functions? I did read the PL-Bert paper and have a better grasp now. Thanks for finding the time to help out a newbie. I'm trying to get my feet on the ground.

s_pred = sampler(noise,
embedding=bert_dur[0].unsqueeze(0), num_steps=diffusion_steps,
embedding_scale=embedding_scale).squeeze(0)

    if s_prev is not None:
        # convex combination of previous and current style
        s_pred = alpha * s_prev + (1 - alpha) * s_pred
    
    s = s_pred[:, 128:]
    ref = s_pred[:, :128]

    d = model.predictor.text_encoder(d_en, s, input_lengths, text_mask)

    x, _ = model.predictor.lstm(d)
    duration = model.predictor.duration_proj(x)
    duration = torch.sigmoid(duration).sum(axis=-1)
    pred_dur = torch.round(duration.squeeze()).clamp(min=1)

    pred_aln_trg = torch.zeros(input_lengths, int(pred_dur.sum().data))
    c_frame = 0
    for i in range(pred_aln_trg.size(0)):
        pred_aln_trg[i, c_frame:c_frame + int(pred_dur[i].data)] = 1
        c_frame += int(pred_dur[i].data)

    # encode prosody
    en = (d.transpose(-1, -2) @ pred_aln_trg.unsqueeze(0).to(device))
    F0_pred, N_pred = model.predictor.F0Ntrain(en, s)
    out = model.decoder((t_en @ pred_aln_trg.unsqueeze(0).to(device)), 
                            F0_pred, N_pred, ref.squeeze().unsqueeze(0))
    
return out.squeeze().cpu().numpy(), s_pred

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diffusion sampler / Bert #169

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Diffusion sampler / Bert #169

georgedei Dec 21, 2023

Replies: 1 comment · 1 reply

yl4579 Dec 22, 2023 Maintainer

georgedei Dec 23, 2023 Author

georgedei
Dec 21, 2023

Replies: 1 comment 1 reply

yl4579
Dec 22, 2023
Maintainer

georgedei Dec 23, 2023
Author