Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VoiceOver-Style Language Switching #21

Open
TTWNO opened this issue Sep 5, 2022 · 6 comments
Open

VoiceOver-Style Language Switching #21

TTWNO opened this issue Sep 5, 2022 · 6 comments
Labels
enhancement New feature or request help wanted Extra attention is needed TTS Improvements to the text to speech subsystem
Milestone

Comments

@TTWNO
Copy link
Member

TTWNO commented Sep 5, 2022

VoiceOver allows you to switch languages mid-string as well as keep settings unique to each language's voice.
For example, if I read the following sentence in VoiceOver: "In Chinese, 你好 means hello!"

VoiceOver will automatically switch between voices and settings to describe the entire sentence in one smooth motion. This can be done even on the tiny CPU on the iPhone SE. Side note: it's not that simple, I think the block of foreign text must be a bit longer for it to switch voices, but there is for sure a threshold and it can switch mid-sentence.

Odilia should eventually have the ability to use this feature. Obviously, without voices set up through speech-dispatcher, you may need to fall back to espeak, but it's still a way to read and write multi-language documents; you should not need to switch back and forth.

Language identification, unless I'm completely wrong, is very likely a fairly complex process relatively speaking. So this should always be optional and as a setting for the user to change.

I believe this should be possible through using the SSIP protocol with speech-dispatcher. I haven't looked deep enough to figure this one out myself, but I suspect that if it isn't possible like this, then I'm really not sure how it could be. More research required.

@TTWNO TTWNO added enhancement New feature or request help wanted Extra attention is needed TTS Improvements to the text to speech subsystem labels Sep 5, 2022
@albertotirla
Copy link
Member

there are afew ways to deal with this, mostly two come to mind right now:

  • the most straight-forward one is to just rely on text attributes, especially on the web. Correct me if I'm wrong, since I don't have much of a web background, however every html page can be marked as being in a specific language, with the language attribute, for example language="en-us". In that case, won't it be logical that specific paragraphs or pieces of text in a paragraph could similarly be annotated with language tags? as an aside, I think that's how wikipedia does it.
  • we can use something like lingua-rs, which won't use html, in stead it actually properly detects the language being used in the text, even though this requires more processing power and would have to be set behind a user-configurable flag, this isn't ment to be in use for long because of the not small at all! memory and cpu consumption, I believe it uses machine learning or something close, so the resource hog is expected.

Also, what you saw in voiceover is voice specific, not voiceover specific, some of that can be done in espeak as well. What's happening over there is that the voice itself knows when a transition from a latin to a non-latin alphabet is happening, therefore it does its own language selection when that text is given to it. I can't write chinese examples because espeak doesn't support that or japanese, so I will do something similar with ukrainian. So, DeepL translate says "ласкаво просимо до оділії скринрідера! "means "welcome to odilia screenreader!" If you read that with espeak, you will hear it change voice and language to spell it as well as it can given your locale, codepage and such. Even though that's not voiceover, nvda or orca specific, we can potentially make that odilia specific, as long as the speech dispatcher module currently in use supports the detected language, which can be wrong sometimes, but better than nothing.
also, we have the problem that speech dispatcher doesn't allow us to change language midsentence, however we can possibly do language processing before feeding the thing to speech dispatcher, then in the language processing faze we insert speech markers where language changes if we can accurately determin that, then when the callback fires with a marker reached event, we know to change language. We could probably track what language we have to change to with some kind of text position, marker name, or whatever that marker event contains, to language mapping. Yes, this may possibly delay spoken things a lot, I'm not sure, however it's a plan of action if nothing else comes to mind untill that feature would come to be implemented.

@mcb2003
Copy link
Contributor

mcb2003 commented Sep 9, 2022

  • the most straight-forward one is to just rely on text attributes, especially on the web. Correct me if I'm wrong, since I don't have much of a web background, however every html page can be marked as being in a specific language, with the language attribute, for example language="en-us". In that case, won't it be logical that specific paragraphs or pieces of text in a paragraph could similarly be annotated with language tags? as an aside, I think that's how wikipedia does it.

From my understanding this is correct, yes.

  • we can use something like lingua-rs, which won't use html, in stead it actually properly detects the language being used in the text, even though this requires more processing power and would have to be set behind a user-configurable flag, this isn't ment to be in use for long because of the not small at all! memory and cpu consumption, I believe it uses machine learning or something close, so the resource hog is expected.

Will look into this more, but cool.

also, we have the problem that speech dispatcher doesn't allow us to change language midsentence, however we can possibly do language processing before feeding the thing to speech dispatcher, then in the language processing faze we insert speech markers where language changes if we can accurately determin that, then when the callback fires with a marker reached event, we know to change language. We could probably track what language we have to change to with some kind of text position, marker name, or whatever that marker event contains, to language mapping. Yes, this may possibly delay spoken things a lot, I'm not sure, however it's a plan of action if nothing else comes to mind untill that feature would come to be implemented.

A much simpler solution would be to use SSIP blocks.

@TheQuinbox
Copy link

Unicode character ranges can also be used for most languages with latin-alphabets, for what its worth. Might also be worth looking into how NVDA on Windows does this.

@albertotirla
Copy link
Member

Unicode character ranges can also be used for most languages with latin-alphabets, for what its worth. Might also be worth looking into how NVDA on Windows does this.

nvda doesn't do a very good job of it either as far as i know, not speaking from a coding/implementation viewpoint here, rather from a user one. Most of the language processing on nvda is either handled by the synthesizer currently speaking it, or by nvda itself, but as far as I know nvda only switches language when UIA or whatever changes the language attribute of the currently read text to something else than the language of the current voice, for example if a paragraph is annotated with the language attribute. About using character ranges, probably that's one of the tricks lingua-rs uses as well, but that alone doesn't guarantee any reliability whatsoever. For example, just try distinguishing, based on that method, german text from an english translation. We know that german has ü, ö, ä, and ß, however once we identified those, what do we do? consider the whole lexical unit german, or try to identify, with a german dictionary, the smallest part of that unit that's german and speak that? what can even be considered a lexical unit, how do we do this, do we make a synthesizer level engine and shuv that in odilia? Or maybe I'm misunderstanding what you mean, in which case please post back with an example or a wider explanation, since all this will be taken into account when we will arrive to that feature set and will have to revisit this in order to implement it.

@TheQuinbox
Copy link

Look at www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml for a list of different languages, and their Unicode character ranges. Wouldn't work for all languages, though.

@albertotirla
Copy link
Member

I wanted to reply to your comment via email, but I guess github doesn't want me to do that, so yeah, will have to post in this field again
thanks for that link, will be very useful, even though I don't personally understand much from it since it's not an actual html table and it's kinda confusing me. Yes, I see what you mean now, however those character ranges are pretty much all non-latin alphabets, aka hiragana and catacana, so that method won't help us separating, say, english from german, plus a synthesizer with that capability can recognise such languages on its own already.

@TTWNO TTWNO added this to the 1.0 milestone Oct 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed TTS Improvements to the text to speech subsystem
Projects
None yet
Development

No branches or pull requests

4 participants