Unicode support #16

rinigus · 2018-03-01T08:14:48Z

Should probably replace all internal string handling with UTF8 aware approach

rinigus · 2018-03-01T12:25:53Z

keywords / links:

https://unicode.org/faq/normalization.html
http://userguide.icu-project.org/boundaryanalysis

rinigus · 2018-03-04T20:13:05Z

I have read up a bit on unicode and related subjects and here is what I learned. I am writing it up to get us on the same page. If you are familiar with the concepts, just jump towards the end :)

To do properly n-grams and search in them, we need to apply normalization, probably NFKC together with case folding (nice description at https://www.elastic.co/guide/en/elasticsearch/guide/current/case-folding.html). In addition, unicode string would have to be tokenized properly using tokenizer from some unicode library to avoid chopping the string in a middle of multibyte char. With the words detected, NFKC and case folding applied, we could search in n-gram database for occurrence of the n-grams fitting the prefix of the entered one. However, since case folding can lead to misspelled word (such as ß->ss), we would have to keep an correct NFKC representation of that word as well. That way, we would have in the database:

Table 1 | NFKC_case_folded ngrams -> number of times occurred, index of the last word
Table 2 | index of the last word -> word in correct NFKC form

Such approach should allow to support any language, as far as I can understand.

At present, Presage is targeting ISO 8859-1 which is a bit extended. For normalization it can use lowercase mode, but, as I found earlier, its far from perfect. In essence, any non ASCII char is kept as it is. With unicode multibyte chars that can probably mess up big time, so maybe lowercasing should be disabled for anything non-Latin.

The library that can do probably all is ICU. However, we would have to rip whole Presage apart by replacing all strings with UnicodeString. Then all processing of these should be replaced as well by the functions that would do similar things. As an input/output we can use UTF8 containing std::string, but internal processing is probably better to do using ICU.

rinigus changed the title ~~Handle lowercase for UTF8~~ Unicode support Mar 4, 2018

tpikonen mentioned this issue Jan 15, 2024

Capitalization / non-LOWERCASE_MODE support #35

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode support #16

Unicode support #16

rinigus commented Mar 1, 2018

rinigus commented Mar 1, 2018

rinigus commented Mar 4, 2018

Unicode support #16

Unicode support #16

Comments

rinigus commented Mar 1, 2018

rinigus commented Mar 1, 2018

rinigus commented Mar 4, 2018