Replies: 3 comments
-
>>> nmstoker |
Beta Was this translation helpful? Give feedback.
-
>>> Plato |
Beta Was this translation helpful? Give feedback.
-
>>> nmstoker |
Beta Was this translation helpful? Give feedback.
-
>>> Plato
[February 8, 2021, 12:43pm]
first: Sorry if this forum is not meant for people to ask general
questions regarding the speech synthesis/TTS field. This is basically
the only active forum on TTS i found and where i can ask some (basic)
questions.
So yeah, basically I'm relatively new to the TTS field and i have a
questions about the data. Most datasets seem too be existing out of
either a single speaker, or multiple speakers and then clearly annotated
which speaker said which sentence. slash
why is this? will a model underprerform drastically when I would just
train using a dataset with multiple speakers and not using a one-hot
encoding vector as amazon did.
I already played around with mozilla deepspeech (STT) before and here
the model was able to get text from the MFCC no matter what speaker said
it. But for TTS it seems that the other way around seems to be harder to
generalize for multiple speakers, why is this? Or how could i fix this
problem when i don't know which reader said which sentence?
Thank you in advance !
ps: i would also appreciate if poeple post papers or relevant links to
this problem or just general introductions to TTS since i do not seem to
succes into finding a very in depth documentation about TTS to lecture
myself (all i can fins are papers of models, but none about the whole
process and the important details to keep in mind )
[This is an archived TTS discussion thread from discourse.mozilla.org/t/new-to-the-tts-field-and-i-have-some-questions-about-the-necessary-data]
Beta Was this translation helpful? Give feedback.
All reactions