You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I previously opened a record on this issue. It was mentioned that it was fixed with this pull request: #305 (comment) . I waited for the update to be released, but since it didn't come, I downloaded it from GitHub and checked it by running it with Docker. However, the problem has not been resolved.
There is a normalization issue in Charabia when processing Turkish characters. Turkish has several unique characters such as "ç", "ğ", "ı", "İ", "ö", "ş", "ü" which need to be normalized correctly for accurate text processing and search indexing. Currently, these characters are not being normalized correctly, which leads to inaccuracies in search results and tokenization.
Steps to Reproduce:
Use Charabia to tokenize and normalize a text containing Turkish characters.
Compare the results with the expected normalized form of Turkish characters.
Example Text:
Original Text: "çalışma, günlük, İstanbul, İstasyon, ömür, şarkı, ütü"
Expected Normalized Form: "calisma, gunluk or ğunluk, istanbul, istasyon, omur, sarki, utu"
Current Behavior:
The Turkish characters are not normalized to their correct forms, leading to inconsistencies in search results.
Expected Behavior:
Turkish characters should be normalized as follows:
This issue affects the accuracy of search results and the effectiveness of tokenization for Turkish text. It is crucial for Charabia to handle these characters correctly to support Turkish language text processing adequately.
Proposed Solution:
Implement a normalization rule for Turkish characters in Charabia.
Ensure that the normalization process correctly transforms Turkish characters to their expected forms.
Thank you for addressing this issue. Accurate normalization for Turkish characters will significantly improve the performance and reliability of Charabia for Turkish language text processing.
The text was updated successfully, but these errors were encountered:
You are seeing the images like this because I copied them from an old issue record. I cloned the repository from Github, built version 1.11 locally using docker build, ran it, and when I imported the data, I saw it worked the same way. I've updated the images with the current ones.
Hello everyone,
I previously opened a record on this issue. It was mentioned that it was fixed with this pull request: #305 (comment) . I waited for the update to be released, but since it didn't come, I downloaded it from GitHub and checked it by running it with Docker. However, the problem has not been resolved.
There is a normalization issue in Charabia when processing Turkish characters. Turkish has several unique characters such as "ç", "ğ", "ı", "İ", "ö", "ş", "ü" which need to be normalized correctly for accurate text processing and search indexing. Currently, these characters are not being normalized correctly, which leads to inaccuracies in search results and tokenization.
Steps to Reproduce:
Use Charabia to tokenize and normalize a text containing Turkish characters.
Compare the results with the expected normalized form of Turkish characters.
Example Text:
Original Text: "çalışma, günlük, İstanbul, İstasyon, ömür, şarkı, ütü"
Expected Normalized Form: "calisma, gunluk or ğunluk, istanbul, istasyon, omur, sarki, utu"
Current Behavior:
The Turkish characters are not normalized to their correct forms, leading to inconsistencies in search results.
Expected Behavior:
Turkish characters should be normalized as follows:
"ç" -> "c"
"ğ" -> "g"
"ı" -> "i"
"I" -> "ı"
"İ" -> "i"
"İ" -> "I"
"ö" -> "o"
"ş" -> "s"
"ü" -> "u"
Impact:
This issue affects the accuracy of search results and the effectiveness of tokenization for Turkish text. It is crucial for Charabia to handle these characters correctly to support Turkish language text processing adequately.
Proposed Solution:
Implement a normalization rule for Turkish characters in Charabia.
Ensure that the normalization process correctly transforms Turkish characters to their expected forms.
To assist you better, I'm also sharing the dump of the data I'm using.
https://depo.niyazialpay.com/20240827-141437507.dump
References:
Thank you for addressing this issue. Accurate normalization for Turkish characters will significantly improve the performance and reliability of Charabia for Turkish language text processing.
The text was updated successfully, but these errors were encountered: