-
Notifications
You must be signed in to change notification settings - Fork 500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ask about Phoneme Segmentation and Phoneme Duration #12
Comments
The repository focuses on the WaveNet vocoder as the name says. It doesn't provide any of phoneme duration estimation and linguistic features extraction, which are needed to replicate original WaveNet-based TTS. The vocoder can take arbitrary type of input assuming time resolution is adjusted, though. Linguistic feature extraction (a.k.a text processing frontend) is the hard part of TTS, which often requires deep knowledge for the target language. The WaveNet vocoder itself is language independent but you will have to implement a text processing frontend if you want to condition the model by linguistic features. |
@r9y9 @imdatsolak Thanks for support. I have found Montreal Forced Aligner Tool that help me with this problem. As I can see in this repo, @r9y9 uses another library nnmnkwii to do the frontend things. Please excuse my ignorance but can you explain for me what the input (after do the frontend things) feed to Wavenet vocoder for local condition? It's very helpful to know the basic ideas about how Wavenet vocoder works. I'm really confused when read this repo's code. Once again thank you! |
There's no text processing frontend used in the repository. nnmnkwii has functionality to extract linguistic features from HTS-style context labels, though. In this repository nnmnkwii is used for mostly preprocesssing. e.g, mulaw or inv_mulaw. https://r9y9.github.io/nnmnkwii/latest/references/preprocessing.html The |
@r9y9 Thanks for enlight me. Finally I got the key thing. One more question, is this possible to use this Wavenet Vocoder repo with Tacotron? Do you plan to do this thing in the future? Any idea suggestion? |
Definitely it's possible. Tacotron2-like wavenet vocoder is WIP at r9y9/deepvoice3_pytorch#21. See also #1 (comment). |
Taoctron + WaveNet was done |
@r9y9 Thanks. I succeed in training Rayhane-mamah's Tacotron 2 repo with my own corpus. Going to try out generate with your WaveNet Vocoder. I'm eager to hear the result. This makes me very exciting! |
@r9y9 I tried to intergrate Rayhane-mamah's Tacotron 2 with WaveNet Vocoder using your Google Colab's code but it failed. The pitch has been lost. Btw, I use |
Hi, @r9y9. First of all, thank you for such a brilliant implementation of Wavenet. Now I study about how to detect a phoneme duration (start time and end time of phoneme) extracted from audio and align this thing with linguistic feature but i don't know how to do this. Can you show me the idea that you use to solve the problem and where is the code in this repo to do this job? Thanks!
P/s: Btw is this possible to train this repo with another language? Currently, I'm doing with Vietnamese with my own dataset (7 hours of audio and ARPABET linguistic feature extracted from text)
The text was updated successfully, but these errors were encountered: