Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word capitalization, everything lowercase #57

Open
ZashIn opened this issue Nov 22, 2023 · 5 comments
Open

Word capitalization, everything lowercase #57

ZashIn opened this issue Nov 22, 2023 · 5 comments

Comments

@ZashIn
Copy link

ZashIn commented Nov 22, 2023

Text generated by Sayboard is all lowercase, except for the first word after punctuation.

Not sure if this is an issue with the models or the app, but it makes the app not very usable beyond casual/lazy chats, especially for languages capitalizing nouns etc., like German.

@unoukujou
Copy link

Agreed.

Especially words like: I, I'm, I'll, I'd.

I don't think there's any situation where those should be lowercase. Please improve this as it'll make a great difference.

@ElishaAz
Copy link
Owner

We can have a list of all the words and if they should be capitalized. Does anyone know of such (multilingual) list?

@ZashIn
Copy link
Author

ZashIn commented Nov 29, 2023

This is probably more an issue of the (smaller) models: alphacep/vosk-api#1204

  • A postprocessing network (punctuation models) should solve this, but the existing models are probably too big for mobile use?
  • E.g. for german, the largest model vosk-model-de-tuda-0.6-900k seems to respect capital letters

Maybe a combination of a small base vosk model with a (reduced) punctuation model would work?

Word lists

We can have a list of all the words and if they should be capitalized. Does anyone know of such (multilingual) list?

Such a word list would need to include the context or consist of generated patterns, since in some languages the capitalization cannot be determined just by the word form itself.

E.g. in German all nouns are capitalized, including nominalization:
verb: schreiben (to write)
noun: [das] Schreiben, ...

So most word lists with nouns etc. would probably result in a lot of incorrect capitalization, since the verb form in such cases is more common.
It might still be possible to generate such a pattern list to improve the output, but I doubt that it is worth it (linguistic coverage, complexity, performance), compared to an optimized postprocessing network.

@unoukujou
Copy link

d66d432178cde25455b746a692aefb1b-827655589

As far as English goes:

Months... (January → December)
I ... (I, I'm , I'd , I'll)
Start of sentences ... (The first word, and then any word after a period)
Names ... (You could get a list of all Countries, States, Cities, Common people names... Won't be perfect but it can take care of probably 90% of what we write.

The hard one is Titles of books/movies and stuff like that, but we can't expect everything. At least taking care of the most common stuff listed above will help tremendously. As of right now using Sayboard is just outputting one long sentence all lowercase, it just doesn't look good and I have to spend 15 min after just to correct everything.

Perhaps also add a user defined list of replacements that the user can add their own list of words to auto-replace, then the user can tune the app for his/her own needs.

Example, I can add:
three → 3 (always replace word three with 3)
monique → Monique (maybe Monique is a name that I say a lot but it never gets capitalized, I can add it to the list myself)

So with such a user defined list, we can fine tune the app to our personal needs.

But definitely have built-in lists of common names, and "I".

@dktzde
Copy link

dktzde commented Mar 25, 2024

I use sayboard mostly in German. For me it would be a great help if all nouns were capitalized.

I have found the following possible word lists:
(Source: https://german.stackexchange.com/questions/25114/suche-eine-umfassende-datenbank-aller-deutschen-w%C3%B6rter)

I understand the restrictions of #57 (comment) but capitalising all nouns would help me a lot. I could still edit nominalisation and similar special cases by hand.

This issue is also related to #58 which would also help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants