-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode support #16
Comments
I have read up a bit on unicode and related subjects and here is what I learned. I am writing it up to get us on the same page. If you are familiar with the concepts, just jump towards the end :) To do properly n-grams and search in them, we need to apply normalization, probably NFKC together with case folding (nice description at https://www.elastic.co/guide/en/elasticsearch/guide/current/case-folding.html). In addition, unicode string would have to be tokenized properly using tokenizer from some unicode library to avoid chopping the string in a middle of multibyte char. With the words detected, NFKC and case folding applied, we could search in n-gram database for occurrence of the n-grams fitting the prefix of the entered one. However, since case folding can lead to misspelled word (such as ß->ss), we would have to keep an correct NFKC representation of that word as well. That way, we would have in the database:
Such approach should allow to support any language, as far as I can understand. At present, Presage is targeting ISO 8859-1 which is a bit extended. For normalization it can use lowercase mode, but, as I found earlier, its far from perfect. In essence, any non ASCII char is kept as it is. With unicode multibyte chars that can probably mess up big time, so maybe lowercasing should be disabled for anything non-Latin. The library that can do probably all is ICU. However, we would have to rip whole Presage apart by replacing all strings with UnicodeString. Then all processing of these should be replaced as well by the functions that would do similar things. As an input/output we can use UTF8 containing std::string, but internal processing is probably better to do using ICU. |
Should probably replace all internal string handling with UTF8 aware approach
The text was updated successfully, but these errors were encountered: