Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ask about Phoneme Segmentation and Phoneme Duration #12

Closed
toannhu opened this issue Jan 30, 2018 · 9 comments
Closed

Ask about Phoneme Segmentation and Phoneme Duration #12

toannhu opened this issue Jan 30, 2018 · 9 comments
Labels

Comments

@toannhu
Copy link

toannhu commented Jan 30, 2018

Hi, @r9y9. First of all, thank you for such a brilliant implementation of Wavenet. Now I study about how to detect a phoneme duration (start time and end time of phoneme) extracted from audio and align this thing with linguistic feature but i don't know how to do this. Can you show me the idea that you use to solve the problem and where is the code in this repo to do this job? Thanks!

1

P/s: Btw is this possible to train this repo with another language? Currently, I'm doing with Vietnamese with my own dataset (7 hours of audio and ARPABET linguistic feature extracted from text)

@r9y9
Copy link
Owner

r9y9 commented Jan 30, 2018

The repository focuses on the WaveNet vocoder as the name says. It doesn't provide any of phoneme duration estimation and linguistic features extraction, which are needed to replicate original WaveNet-based TTS. The vocoder can take arbitrary type of input assuming time resolution is adjusted, though.

Linguistic feature extraction (a.k.a text processing frontend) is the hard part of TTS, which often requires deep knowledge for the target language. The WaveNet vocoder itself is language independent but you will have to implement a text processing frontend if you want to condition the model by linguistic features.

@imdatceleste
Copy link

@toannhu , you might be interested in Aeneas if you are only looking for phoneme detection. It is not meant for phonemes but by adjusting various parameters, it might be helpful in understanding how to do what you want to do.

@r9y9 r9y9 added the question label Feb 3, 2018
@toannhu
Copy link
Author

toannhu commented Feb 6, 2018

@r9y9 @imdatsolak Thanks for support. I have found Montreal Forced Aligner Tool that help me with this problem. As I can see in this repo, @r9y9 uses another library nnmnkwii to do the frontend things. Please excuse my ignorance but can you explain for me what the input (after do the frontend things) feed to Wavenet vocoder for local condition? It's very helpful to know the basic ideas about how Wavenet vocoder works. I'm really confused when read this repo's code. Once again thank you!

@r9y9
Copy link
Owner

r9y9 commented Feb 6, 2018

There's no text processing frontend used in the repository. nnmnkwii has functionality to extract linguistic features from HTS-style context labels, though. In this repository nnmnkwii is used for mostly preprocesssing. e.g, mulaw or inv_mulaw. https://r9y9.github.io/nnmnkwii/latest/references/preprocessing.html

The WaveNet class in the repository doesn't assume any particular domain of the conditional features, but training / pre-processing scripts are written assuming mel-spectrogram is used for the conditional feature.

@toannhu
Copy link
Author

toannhu commented Feb 6, 2018

@r9y9 Thanks for enlight me. Finally I got the key thing. One more question, is this possible to use this Wavenet Vocoder repo with Tacotron? Do you plan to do this thing in the future? Any idea suggestion?

@r9y9
Copy link
Owner

r9y9 commented Feb 6, 2018

Definitely it's possible. Tacotron2-like wavenet vocoder is WIP at r9y9/deepvoice3_pytorch#21.

See also #1 (comment).

@r9y9
Copy link
Owner

r9y9 commented May 16, 2018

Taoctron + WaveNet was done

@r9y9 r9y9 closed this as completed May 16, 2018
@toannhu
Copy link
Author

toannhu commented May 22, 2018

@r9y9 Thanks. I succeed in training Rayhane-mamah's Tacotron 2 repo with my own corpus. Going to try out generate with your WaveNet Vocoder. I'm eager to hear the result. This makes me very exciting!

@toannhu
Copy link
Author

toannhu commented May 27, 2018

@r9y9 I tried to intergrate Rayhane-mamah's Tacotron 2 with WaveNet Vocoder using your Google Colab's code but it failed. The pitch has been lost. Btw, I use batch_size = 16 and r=2 in Tacotron 2 and batch_size = 1 in your repo, everything else is default. Wavenet Repo is training with original sound not with GTA. Here are the results.
sound.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants