Tokenizer Internationalization - French #3

clusterfudge · 2016-01-08T17:12:58Z

We should test to see if the EnglishTokenizer impl is sufficient for French, and if not, add an additional tokenizer. EnglishTokenizer is based on porter stemmer.

gcrieloue-main · 2016-03-11T23:04:59Z

For words such as "j'ajoute", I would like "ajoute" to be a word (a keyword actually) but it doesn't work.

I think french tokenizer is pretty similar to the english one except for this quote rule (which has exceptions such as words like "aujourd'hui").

penrods · 2018-03-15T17:07:38Z

I know this is really old, but I'm curious if this fits in to the "normalize()" approach I've implemented in English. Essentially I do a pre-pass on text that does things like coverts "it's" to "it is", simplifying parsing.

Does it make sense to do a French normalize() preprocessor that converts things like "j'amie" to "je amie"? This would live in: https://github.com/MycroftAI/mycroft-core/blob/dev/mycroft/util/lang/parse_fr.py#L1027

gcrieloue-main · 2018-03-15T17:33:42Z

Hello, While it's and it is are both valid in English, Sadly "je aime" is not valid in French. (And btw Amie is not a verb, it means friend) Le jeu. 15 mars 2018 à 18:07, Steve Penrod <[email protected]> a écrit :

…

I know this is really old, but I'm curious if this fits in to the "normalize()" approach I've implemented in English. Essentially I do a pre-pass on text that does things like coverts "it's" to "it is", simplifying parsing. Does it make sense to do a French normalize() preprocessor that converts things like "j'amie" to "je amie"? This would live in: https://github.com/MycroftAI/mycroft-core/blob/dev/mycroft/util/lang/parse_fr.py#L1027 — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE9-PRNB7k1UT6fEZVi3QdopPojwD2i2ks5tep_egaJpZM4HBT-4> .

penrods · 2018-03-16T07:25:18Z

C'est la vie! There is a reason I shouldn't be the one implementing the French parsers. :)

clusterfudge added the ready label Mar 22, 2016

clusterfudge removed the ready label Jun 23, 2021

clusterfudge added the Deferred post-1.0 label Sep 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer Internationalization - French #3

Tokenizer Internationalization - French #3

clusterfudge commented Jan 8, 2016

gcrieloue-main commented Mar 11, 2016

penrods commented Mar 15, 2018

gcrieloue-main commented Mar 15, 2018 via email

penrods commented Mar 16, 2018

Tokenizer Internationalization - French #3

Tokenizer Internationalization - French #3

Comments

clusterfudge commented Jan 8, 2016

gcrieloue-main commented Mar 11, 2016

penrods commented Mar 15, 2018

gcrieloue-main commented Mar 15, 2018 via email

penrods commented Mar 16, 2018