Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode support #16

Open
rinigus opened this issue Mar 1, 2018 · 2 comments
Open

Unicode support #16

rinigus opened this issue Mar 1, 2018 · 2 comments

Comments

@rinigus
Copy link
Collaborator

rinigus commented Mar 1, 2018

Should probably replace all internal string handling with UTF8 aware approach

@rinigus
Copy link
Collaborator Author

rinigus commented Mar 1, 2018

@rinigus
Copy link
Collaborator Author

rinigus commented Mar 4, 2018

I have read up a bit on unicode and related subjects and here is what I learned. I am writing it up to get us on the same page. If you are familiar with the concepts, just jump towards the end :)

To do properly n-grams and search in them, we need to apply normalization, probably NFKC together with case folding (nice description at https://www.elastic.co/guide/en/elasticsearch/guide/current/case-folding.html). In addition, unicode string would have to be tokenized properly using tokenizer from some unicode library to avoid chopping the string in a middle of multibyte char. With the words detected, NFKC and case folding applied, we could search in n-gram database for occurrence of the n-grams fitting the prefix of the entered one. However, since case folding can lead to misspelled word (such as ß->ss), we would have to keep an correct NFKC representation of that word as well. That way, we would have in the database:

Table 1 | NFKC_case_folded ngrams -> number of times occurred, index of the last word
Table 2 | index of the last word -> word in correct NFKC form

Such approach should allow to support any language, as far as I can understand.

At present, Presage is targeting ISO 8859-1 which is a bit extended. For normalization it can use lowercase mode, but, as I found earlier, its far from perfect. In essence, any non ASCII char is kept as it is. With unicode multibyte chars that can probably mess up big time, so maybe lowercasing should be disabled for anything non-Latin.

The library that can do probably all is ICU. However, we would have to rip whole Presage apart by replacing all strings with UnicodeString. Then all processing of these should be replaced as well by the functions that would do similar things. As an input/output we can use UTF8 containing std::string, but internal processing is probably better to do using ICU.

@rinigus rinigus changed the title Handle lowercase for UTF8 Unicode support Mar 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant