Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persian language support for normalization and segmentation #304

Open
Ja7ad opened this issue Aug 12, 2024 · 6 comments
Open

Persian language support for normalization and segmentation #304

Ja7ad opened this issue Aug 12, 2024 · 6 comments
Labels
good first issue Good for newcomers

Comments

@Ja7ad
Copy link

Ja7ad commented Aug 12, 2024

Hello

Thank you for your continuous efforts in maintaining and improving Charabia. I’m writing to request support for the Persian language in your normalization and segmentation modules, similar to the existing support for Arabic.

Background

Persian (Farsi) is a widely spoken language, using the same script as Arabic with some additional letters. Although Persian shares many similarities with Arabic, there are important differences in orthography, morphology, and syntax that require distinct handling for proper text processing, especially in tasks like tokenization, normalization, and segmentation.

  • Language ISO code is: per, fa

Feature Request

I would like to request the addition of Persian language support for:

  1. Normalization:

    • Handling Persian-specific characters, such as "گ", "چ", "پ", "ژ".
    • Differentiating between Arabic and Persian diacritics and letters where applicable (e.g., "ی" vs. "ي", "ک" vs. "ك").
    • Normalizing Persian numerals (۰-۹) and ensuring compatibility with Arabic numerals where necessary.
  2. Segmentation:

    • Properly segmenting Persian text based on its unique grammatical structure.
    • Handling word boundaries and tokenization in the context of Persian, considering the language's syntax and morphology.

Screenshot from 2024-08-12 11-57-15
Screenshot from 2024-08-12 11-57-30

References

To aid in this implementation, here are the links to the current normalization and segmentation implementations for Arabic, which can serve as a starting point for Persian:

Conclusion

Implementing Persian language support would greatly benefit users who need to process Persian text accurately. Persian is distinct enough from Arabic that this feature would significantly improve text processing capabilities for Persian-speaking users. I’m happy to contribute in any way I can to support this effort.

@curquiza curquiza added the good first issue Good for newcomers label Aug 12, 2024
@Ja7ad
Copy link
Author

Ja7ad commented Aug 18, 2024

@curquiza @Kerollmops I have issue on implementation, whatlang don't support Persian script.

In Persian we have many unicodes, Arabic doesn't support it. for example:

https://www.unicode.org/charts/PDF/U0600.pdf

image
image
image
image

I can't pass normalization test for this issue and whatlang don't support Persian script for this.

This repo is old and no have activity for add Persian script.

Ja7ad@f9b58e0

Ja7ad@029423a

I think better meilisearch make a fork of whatlang and update this crates.

@ManyTheFish
Copy link
Member

Hello @Ja7ad,
WhatLang doesn't support Persian script, but which script is assigned to Persian instead? Arabic?
If I understand well, Arabic and Persian share a lot of characters; if that's the case, I'd like to consider them as the same script for Charabia. This would avoid splitting words into parts.
If everything were considered Arabic, would it be relevant to apply this normalization to any Arabic Language?

Thank you for all the precision!

@Ja7ad
Copy link
Author

Ja7ad commented Aug 27, 2024

Hello @Ja7ad, WhatLang doesn't support Persian script, but which script is assigned to Persian instead? Arabic? If I understand well, Arabic and Persian share a lot of characters; if that's the case, I'd like to consider them as the same script for Charabia. This would avoid splitting words into parts. If everything were considered Arabic, would it be relevant to apply this normalization to any Arabic Language?

Thank you for all the precision!

Some character in Persian is not support in Arabic, Please see attachment screenshot.

@ManyTheFish
Copy link
Member

Yes, I understood that,
however, the technical approach of Charabia is a simplification of the real linguistical state of Languages.
For instance, the characters you listed before are considered Arabic by Charabia even if it's not exactly true.
But considering Persian and Arabic as the same script is convenient if they share a lot of common characters.

For instance Chinese and Japanese are completely different but share some characters, the Kanjies. This forces Charabia to have a "virtual" script Cj containing both scripts, avoiding splitting a word in 2 because it contains different scripts.

The real question on my side is, should we normalize some Persian characters that are used in Arabic Language that shouldn't be normalized if they were used in an Arabic context?

If yes, is Persian a Language or a Script?
If no, normalizing your character anyway should work

@Ja7ad
Copy link
Author

Ja7ad commented Aug 28, 2024

Yes, I understood that,
however, the technical approach of Charabia is a simplification of the real linguistical state of Languages.
For instance, the characters you listed before are considered Arabic by Charabia even if it's not exactly true.
But considering Persian and Arabic as the same script is convenient if they share a lot of common characters.

For instance Chinese and Japanese are completely different but share some characters, the Kanjies. This forces Charabia to have a "virtual" script Cj containing both scripts, avoiding splitting a word in 2 because it contains different scripts.

The real question on my side is, should we normalize some Persian characters that are used in Arabic Language that shouldn't be normalized if they were used in an Arabic context?

If yes, is Persian a Language or a Script?
If no, normalizing your character anyway should work

Yes it's Persian language
Even segmentation is different.

@kamiyn
Copy link

kamiyn commented Aug 28, 2024

I agree with this issue.

Arabic and Persian use many of the same letters, but they are quite different languages. They belong to different language families ( https://en.wikipedia.org/wiki/Language_family )
and their grammar is completely different.

Persian has grammar that is closer to European languages than Arabic.

As a Japanese person, I feel that the difference between Persian and Arabic is similar to one between Japanese and Chinese.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants