New to the tts field and i have some questions about the necessary data #352

JRMeyer · 2021-03-07T09:18:49Z

JRMeyer
Mar 7, 2021
Maintainer

>>> Plato
[February 8, 2021, 12:43pm]

first: Sorry if this forum is not meant for people to ask general
questions regarding the speech synthesis/TTS field. This is basically
the only active forum on TTS i found and where i can ask some (basic)
questions.

So yeah, basically I'm relatively new to the TTS field and i have a
questions about the data. Most datasets seem too be existing out of
either a single speaker, or multiple speakers and then clearly annotated
which speaker said which sentence. slash
why is this? will a model underprerform drastically when I would just
train using a dataset with multiple speakers and not using a one-hot
encoding vector as amazon did.

I already played around with mozilla deepspeech (STT) before and here
the model was able to get text from the MFCC no matter what speaker said
it. But for TTS it seems that the other way around seems to be harder to
generalize for multiple speakers, why is this? Or how could i fix this
problem when i don't know which reader said which sentence?

Thank you in advance !

ps: i would also appreciate if poeple post papers or relevant links to
this problem or just general introductions to TTS since i do not seem to
succes into finding a very in depth documentation about TTS to lecture
myself (all i can fins are papers of models, but none about the whole
process and the important details to keep in mind )

[This is an archived TTS discussion thread from discourse.mozilla.org/t/new-to-the-tts-field-and-i-have-some-questions-about-the-necessary-data]

JRMeyer · 2021-03-07T09:18:52Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> nmstoker
[February 9, 2021, 2:41pm]

Hi [ slash Plato](
this is a perfectly reasonable thing to ask about. Whilst it's general,
this kind of discussion seems useful for people using the TTS repo.

First to address your PS: slash
If you want some relevant papers on TTS, then there's a great repo
listing several key ones here: slash

{.site-icon

### erogol/TTS-papers

Collection of Text to Speech papers. Contribute to erogol/TTS-papers
development by creating an account on GitHub.

Those are academic papers - several of them will include links to
samples and often GitHub repos for the authors own code. In a few
notable cases the models covered have been implemented in the TTS repo.

If you want something more beginner friendly then googling is best -
often there are university lectures on YouTube for topics like this that
can be a good start. I don't have any to hand now but I'll see what i
can dig out.

Here's something that gives a bit of an introduction over time, but if
you're motivated mainly by getting up to speed then I'm guessing you may
not need to go into too much depth on the older methods (ie slash < deep
learning approaches) but it's handy to know they exist and often
resources related to them can still be useful. slash

20](https://medium.com/sciforce/text-to-speech-synthesis-an-overview-641c18fcd35f '04:05PM - 13 February 2020')

### Text-to-Speech Synthesis: an Overview

We discuss what TTS is, how its quality is measured. We compare
different approaches to TTS: concatenative & parametric TTS, Deep...

[Reading time: 8 min read.label1}

slash
The Wikipedia entry is also pretty comprehensive slash

en.m.wikipedia.org

### Speech synthesis

Speech synthesis is the artificial production of human speech. A
computer system used for this purpose is called a speech computer or
speech synthesizer, and can be implemented in software or hardware
products. A text-to-speech (TTS) system converts normal language text
into speech; other systems render symbolic linguistic representations
like phonetic transcriptions into speech. Synthesized speech can be
created by concatenating pieces of recorded speech that are stored in a
database. Systems d slash ...

Now to your main question. I can give a partial answer but you may
benefit from comments from those more involved in multi speaker usage.

Firstly, this is something of an empirical subject - with so many
factors, often doing gives the best insights. slash
My suspicion is that if you have multiple voices but they are not
identified to the model it will struggle enormously to create a decent
voice - if you look at the waveform for different speakers they can be
dramatically different. So the model will be simultaneously be
attempting to get good for samples from one speaker and another and it
won't know which is right. You as a human who understands speech may be
overlooking just how different different speakers are because you've
learned to understand them.

This comes up to an extent even when training with a single voice. If
the samples from the speaker are too varied (ie differing in style too
much, and arbitrarily) then the voice quality struggles. slash
One example where this happens is with the EK1 dataset trained model
(search in the forum for that if interested). In that dataset, derived
from LibriVox novels, the speaker often reads characters in this
corresponding accent. I suspect this impairs the quality somewhat and
it's only thanks to the substantial amount of training data in that set
(32+ hrs) that overcomes it.

Not sure if this is practical but depending on how many speakers you
think you have, you may be able to use speaker diarisation techniques to
identify them and label your dataset from the clusters. One of the
notebooks in the repo has code to make a start on that slash
https://github.com/mozilla/TTS/blob/master/notebooks/PlotUmapLibriTTS.ipynb
(although I'm having trouble getting it to display currently)

I hope that's a start at least

[Archived Post]

0 replies

JRMeyer · 2021-03-07T09:18:54Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> Plato
[February 12, 2021, 4:29pm]

thank you for the
extensive reply! It really helped me to understand it a bit further and
the extra resources are great!

When looking a bit more around into vocoders i ended up with the same
question. Must a vocoder be trained on 1 person, or can it be trained on
multiple speakers without using a speaker embedding / one hot vector
encoding. slash
For example: can i just feed it MFCC, audio pairs and expect a good
result, or should i feed it MFCC,Audio, speaker triplets ? Because from
most papers that i've read my feelings tells me that just MFCC, audio
pairs should be enough. But if this is the case, how does the output
voice sound like? Can it be different dependent on the MFCC or will it
be the same voice in the wav everytime?

[Archived Post]

0 replies

JRMeyer · 2021-03-07T09:18:57Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> nmstoker
[February 12, 2021, 5:40pm]

So a vocoder can be a single voice or multiple ones (to create a
universal vocoder). I've only trained them in the single voice but there
is a universal vocoder available via the latest version of the TTS repo

For the single voice case you just give it the audio and it creates the
spectrograms itself. I believe that it's the same with multi voice but
it's worth looking at the code/config files in case I've missed
something for multiple voice usage

[Archived Post]

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New to the tts field and i have some questions about the necessary data #352

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

New to the tts field and i have some questions about the necessary data #352

JRMeyer Mar 7, 2021 Maintainer

Replies: 3 comments

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author