Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Package streaming End-to-End STT to TTS #218

Closed
Katehuuh opened this issue May 11, 2024 · 4 comments
Closed

[Feature Request]: Package streaming End-to-End STT to TTS #218

Katehuuh opened this issue May 11, 2024 · 4 comments

Comments

@Katehuuh
Copy link

I’ve seen streaming TTS PR, and like the simple STT to TTS loop available in SillyTavern, it doesn't require any action from the user.
I thought you could add my script whisper I’ve made a fast STT along TTS (alltalk_tts) combined or optional with my ooba extensions fast STT script, to make a package streaming End-to-End STT to TTS so that user can answer naturally without Record/Press enter like from the defaut whisper_stt extensions.

While it works fine with auto enter key workaround, I did not find way to Generate JS in Gradio streaming.

@Katehuuh Katehuuh changed the title Package streaming End-to-End STT to TTS [Feature Request]: Package streaming End-to-End STT to TTS May 11, 2024
@erew123
Copy link
Owner

erew123 commented May 12, 2024

Hi @Katehuuh

I have been considering adding whisper in to AllTalk in a couple of ways, so this could quite well fit into that :)

So let me just ask a couple of questions on this:

  1. If I am understanding correctly, this would be an always on microphone scenario (or we could make it a checkbox for "keep the microphone on when this checkbox is selected), and you can just naturally interact via speech, with it auto submitting the STT generation back into text-gen-webui. Have I got that correct as a loose understanding?

  2. I see you have tested on Windows, so I would need to test Linux? and if I can find someone who has a mac, I can get them to test.

  3. As we arent using the Streaming TTS just yet (waiting to see if it gets approved) we may have to figure out how this all interacts. Im not yet sure how easy it is to stop/cancel the streaming TTS generation from Text-gen-webui. I am on AllTalk v2 building in a way of stopping TTS generation (if the text has already been sent to AllTalk) but no idea how Text-gen-webui can be sent a "stop sending the text over for TTS".

Where this all gets complicated is multi-threading requests within Python and access to the GPU cores. Meaning, that if the LLM is controlling all the tensor cores of a GPU, it may not be happy also trying to generate TTS in the cores at the same time.... Ill have to think on this and look at it when we can play with the streaming generation. I guess Im more just putting this number 3 in here for my own reference/thoughts when I get to look at this again.

  1. Would you just be able to explain this a little bit more for me While it works fine with auto enter key workaround, I did not find way to Generate JS in Gradio streaming. Im assuming you are saying that this function:
def generate_transcribe():
    keyboard.send("enter")

Was the only way to commit the generated STT to the chat? I know that text-gen-webui has recently moved from Gradio 3.5.2 to 4.28 (I think) and so maybe there are some better options within that version. What would be the benefit of going to "Generate JS in Gradio streaming"

Sorry for the questions, Im just trying to get this fixed into my head! And thanks for offering your code! :)

@Katehuuh
Copy link
Author

I’ve modified the default whisper_stt extensions to create ooba-insanely-fast-whisper,
Suggest you do the same, start from the simple defaut whisper_stt in alltalk_tts for its simplicity and that I only have as a workaround.

  1. If I am understanding correctly, this would be an always on microphone scenario (or we could make it a checkbox for "keep the microphone on when this checkbox is selected), and you can just naturally interact via speech, with it auto submitting the STT generation back into text-gen-webui. Have I got that correct as a loose understanding?
  1. Mostly. “always on microphone” is part of Gradio: Real Time Speech Recognitio:
    from audio = gr.Audio(source="microphone") to audio = gr.Audio(source="microphone", streaming=True).
«loop» for multiple reapeating step:
  • Using Silero VAD speech_prob to detect "silence" when background noise or voice from alltalk_tts TTS
  • Using STT insanely-fast-whisper for speed-flash_attn_2
  • Combined chunk with possibly cut-end sentence if not the last, If it’s the last chunk then Generate.
  1. I see you have tested on Windows, so I would need to test Linux? and if I can find someone who has a mac, I can get them to test.
  1. All modules are cross-platform and should work on Linux/Mac.
  1. Would you just be able to explain this a little bit more for me While it works fine with auto enter key workaround, I did not find way to Generate JS in Gradio streaming.
  1. Yes, The default whisper_stt use JS to click on Generate with gradio:
    audio.stop_recording(
        auto_transcribe, [audio, auto_submit, whipser_model, whipser_language], [shared.gradio['textbox'], audio]).then(
        None, auto_submit, None, js="(check) => {if (check) { document.getElementById('Generate').click() }}")

by using streaming=True, I couldn’t use auto_submit:

        None, auto_submit, None, _js="(False) => { console.log('Check:', check); if (check) { document.getElementById('Generate').click(); }}");

so for check I simply disable it (False), instead i opt for module keyboard, workaround not working for share=True shared URL link on other devices like phones, or if you just click away from the chat field..

@erew123
Copy link
Owner

erew123 commented May 15, 2024

Hi @Katehuuh

Thanks for the reply. What Im going to do is put a link to this in the Feature Requests. Im so deep into working on v2 of AllTalk, I think its something I will try put in there as Im hoping to have a beta out soon.

Feature requests

I may well get back to you if I get stuck somewhere along the lines.

Thanks

@erew123 erew123 closed this as completed May 15, 2024
@Katehuuh
Copy link
Author

I've notice HF attempt the same: https://github.com/huggingface/speech-to-speech. Thought done manually will give more control.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants