-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer Internationalization - French #3
Comments
For words such as "j'ajoute", I would like "ajoute" to be a word (a keyword actually) but it doesn't work. I think french tokenizer is pretty similar to the english one except for this quote rule (which has exceptions such as words like "aujourd'hui"). |
I know this is really old, but I'm curious if this fits in to the "normalize()" approach I've implemented in English. Essentially I do a pre-pass on text that does things like coverts "it's" to "it is", simplifying parsing. Does it make sense to do a French normalize() preprocessor that converts things like "j'amie" to "je amie"? This would live in: https://github.com/MycroftAI/mycroft-core/blob/dev/mycroft/util/lang/parse_fr.py#L1027 |
Hello,
While it's and it is are both valid in English,
Sadly "je aime" is not valid in French.
(And btw Amie is not a verb, it means friend)
Le jeu. 15 mars 2018 à 18:07, Steve Penrod <[email protected]> a
écrit :
… I know this is really old, but I'm curious if this fits in to the
"normalize()" approach I've implemented in English. Essentially I do a
pre-pass on text that does things like coverts "it's" to "it is",
simplifying parsing.
Does it make sense to do a French normalize() preprocessor that converts
things like "j'amie" to "je amie"? This would live in:
https://github.com/MycroftAI/mycroft-core/blob/dev/mycroft/util/lang/parse_fr.py#L1027
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AE9-PRNB7k1UT6fEZVi3QdopPojwD2i2ks5tep_egaJpZM4HBT-4>
.
|
C'est la vie! There is a reason I shouldn't be the one implementing the French parsers. :) |
We should test to see if the EnglishTokenizer impl is sufficient for French, and if not, add an additional tokenizer. EnglishTokenizer is based on porter stemmer.
The text was updated successfully, but these errors were encountered: