Normalization Issue for Turkish Characters in Charabia #316

niyazialpay · 2024-10-15T09:28:24Z

Hello everyone,

I previously opened a record on this issue. It was mentioned that it was fixed with this pull request: #305 (comment) . I waited for the update to be released, but since it didn't come, I downloaded it from GitHub and checked it by running it with Docker. However, the problem has not been resolved.

There is a normalization issue in Charabia when processing Turkish characters. Turkish has several unique characters such as "ç", "ğ", "ı", "İ", "ö", "ş", "ü" which need to be normalized correctly for accurate text processing and search indexing. Currently, these characters are not being normalized correctly, which leads to inaccuracies in search results and tokenization.

Steps to Reproduce:

Use Charabia to tokenize and normalize a text containing Turkish characters.
Compare the results with the expected normalized form of Turkish characters.
Example Text:

Original Text: "çalışma, günlük, İstanbul, İstasyon, ömür, şarkı, ütü"
Expected Normalized Form: "calisma, gunluk or ğunluk, istanbul, istasyon, omur, sarki, utu"
Current Behavior:

The Turkish characters are not normalized to their correct forms, leading to inconsistencies in search results.
Expected Behavior:

Turkish characters should be normalized as follows:

"ç" -> "c"
"ğ" -> "g"
"ı" -> "i"
"I" -> "ı"
"İ" -> "i"
"İ" -> "I"
"ö" -> "o"
"ş" -> "s"
"ü" -> "u"

Impact:

This issue affects the accuracy of search results and the effectiveness of tokenization for Turkish text. It is crucial for Charabia to handle these characters correctly to support Turkish language text processing adequately.

Proposed Solution:

Implement a normalization rule for Turkish characters in Charabia.
Ensure that the normalization process correctly transforms Turkish characters to their expected forms.

To assist you better, I'm also sharing the dump of the data I'm using.
https://depo.niyazialpay.com/20240827-141437507.dump

References:

Thank you for addressing this issue. Accurate normalization for Turkish characters will significantly improve the performance and reliability of Charabia for Turkish language text processing.

ManyTheFish · 2024-10-16T07:58:01Z

Hello @niyazialpay, I see you are using v1.8.0, which is prior to the fix. Could you retry with v1.11.0-rc.1?

I suggested trying it in the following comments:
#294 (comment)

Did you have any issues with it?

niyazialpay · 2024-10-16T16:46:12Z

You are seeing the images like this because I copied them from an old issue record. I cloned the repository from Github, built version 1.11 locally using docker build, ran it, and when I imported the data, I saw it worked the same way. I've updated the images with the current ones.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalization Issue for Turkish Characters in Charabia #316

Normalization Issue for Turkish Characters in Charabia #316

niyazialpay commented Oct 15, 2024 •

edited

Loading

ManyTheFish commented Oct 16, 2024 •

edited

Loading

niyazialpay commented Oct 16, 2024

Normalization Issue for Turkish Characters in Charabia #316

Normalization Issue for Turkish Characters in Charabia #316

Comments

niyazialpay commented Oct 15, 2024 • edited Loading

Steps to Reproduce:

Turkish characters should be normalized as follows:

Impact:

Proposed Solution:

ManyTheFish commented Oct 16, 2024 • edited Loading

niyazialpay commented Oct 16, 2024

niyazialpay commented Oct 15, 2024 •

edited

Loading

ManyTheFish commented Oct 16, 2024 •

edited

Loading