Support voice synthesis to Vec<u8> #30

Bear-03 · 2022-07-22T15:35:30Z

This aims to solve #12.

Bear-03 · 2022-07-22T15:35:57Z

I've only implemented WinRT for now, but I'll look into how to implement it for the other backends.

ndarilek · 2022-07-22T15:41:43Z

neat, thanks! More backends would be great--what often happens is folks do one and I end up having to do the rest. :) You won't be able to do tolk, but if you can at least cover web, I'll look into the others.

Also, I wonder if we should use Vec<u8> or some other, slightly smarter container for audio? I'd like to be sure there's a known output format for whatever audio data we get, and I'm concerned that each synth might have its own concept of what format to use for synthesized audio. So we might end up with a situation where different platforms output different formats and the crate is no longer cross-platform.

Thanks again.

Bear-03 · 2022-07-22T15:49:09Z

I'd like to be sure there's a known output format for whatever audio data we get

For my current project (where I'm going to be using tts-rs) I use PCM, specifically PCM 16-bit. PCM 16-bit, floating point and unsigned 16-bit are the three formats that cpal supports, and since it is a popular crate, I'd assume implementing those would be more than enough.

I remember reading that the bytes WinRT returns are already PCM, but I'm not too sure, we could do some research. Of course, keeping the library cross-platform is a priority.

ndarilek · 2022-07-22T15:53:03Z

Gotcha. I'm guessing the way forward is to synthesize to something other than a raw vec but which indicates its format. Then, if we discover everything just happens to use the same format, we can drop that requirement and just send raw bytes. I feel like whenever I have to pipe bytes to an audio library, I'm often required to know things about them (I.e. sample rate, bit depth, etc.) I want to make sure we're giving folks that information if it's going to differ from engine to engine so they don't have to figure it out themselves. Thanks again.

Bear-03 · 2022-07-22T15:59:40Z

Yes, now that you said it, that's true, you're often required to provide a lot of parameters to play audio or save it to a file. I'm pretty sure that those are constant for a given backend, so it would be a matter of creating something like a Spec struct that holds that data for each backend.

ndarilek · 2022-07-22T16:06:07Z

Does cpal not have some sort of audio container with all this data that we can return directly? I'm a bit hesitant to have the audio parameters be a separate thing you need access to--I'd rather the return value include everything necessary, if possible.

Bear-03 · 2022-07-22T16:14:22Z

cpal uses SupportedStreamConfig, which holds the info about your input/output device.

Returning the audio metadata every time you synthesize would be wasteful, in my opinion, as you're usingresources for things that aren't really needed. The audio metadata won't change during runtime, so generating it once and letting the developer store it is far more efficient.

ndarilek · 2022-07-22T16:17:25Z

Gotcha, I'd hoped it'd be part of the returned container. Anyhow, if there's some way we can autogenerate it once and cache it, that might be useful. I'm a bit concerned about these formats changing, and of having to maintain/sync hard-coded structs. But maybe that's not warranted. I'll see what you come up with.

Bear-03 · 2022-07-22T18:13:29Z

if there's some way we can autogenerate it once and cache it, that might be useful

And then return it in the container?

I'm a bit concerned about these formats changing, and of having to maintain/sync hard-coded structs

I am sure that it is impossible to retrieve the audio metadata from the audio bytes themselves, as you need the metadata first to then interpret them.

Afaik WinRT doesn't have any way to get the audio spec (I'll have a look), so the only alternative is hard-coding it. My idea is to have something like min_rate(), normal_rate() and max_rate(), that returns the Spec for each backend.

Bear-03 added 2 commits July 22, 2022 17:29

Prepare Backend and Features for synthesis

bf522b4

Add synthesis support to WinRT

b85ffc8

Merge branch 'master' into synthesize

9cd66c0

Fix errors on stable toolchain

87cf05f

Bear-03 added 2 commits July 23, 2022 14:03

Remove InMemoryRandomAccessStream in WinRt::speak

91a0f03

Add Tts::synthesize method

4f4ab53

This was referenced Oct 6, 2023

《OS-TTS》の再生終了判定が正しく行われず複数回または複数種類のTTSあるいは《CoeiroInk》との同時使用に問題が発生します usagi/virtual-avatar-connect#13

Closed

Q: Speech as Vec<u8> or &[u8] #12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support voice synthesis to Vec<u8> #30

Support voice synthesis to Vec<u8> #30

Bear-03 commented Jul 22, 2022

Bear-03 commented Jul 22, 2022

ndarilek commented Jul 22, 2022

Bear-03 commented Jul 22, 2022

ndarilek commented Jul 22, 2022 via email

Bear-03 commented Jul 22, 2022

ndarilek commented Jul 22, 2022

Bear-03 commented Jul 22, 2022

ndarilek commented Jul 22, 2022 via email

Bear-03 commented Jul 22, 2022

Support voice synthesis to Vec<u8> #30

Are you sure you want to change the base?

Support voice synthesis to Vec<u8> #30

Conversation

Bear-03 commented Jul 22, 2022

Bear-03 commented Jul 22, 2022

ndarilek commented Jul 22, 2022

Bear-03 commented Jul 22, 2022

ndarilek commented Jul 22, 2022 via email

Bear-03 commented Jul 22, 2022

ndarilek commented Jul 22, 2022

Bear-03 commented Jul 22, 2022

ndarilek commented Jul 22, 2022 via email

Bear-03 commented Jul 22, 2022