Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An issue on Searching Japanese words. #1875

Open
su-jang opened this issue Oct 27, 2024 · 15 comments
Open

An issue on Searching Japanese words. #1875

su-jang opened this issue Oct 27, 2024 · 15 comments

Comments

@su-jang
Copy link

su-jang commented Oct 27, 2024

An issue on Searching Japanese words.

When searching for Japanese words, there is no distinction between unvoiced consonants(清音) and voiced consonants(濁音).
For example, if you search for a word "Agaru(あがる)", "Agaru(あがる)" and "Akaru(あかる)" will be searched at the same time.

Those are making the search display very complicated.

스크린샷, 2024-10-27 09-09-51

@xiaoyifang
Copy link
Owner

if you search for a word "Agaru(あがる)", "Agaru(あがる)" and "Akaru(あかる)" will be searched at the same time.

try add japanese morphology
image

And as a side topic, the GoldenDict-NG does not have setting options for "exact search", "forward search", and "backward search". The related issu was discussed earlier at github, #79.

#79 discuss about the page search which rely on chrome's search functionality. Chrome does not support it.
image

@darlopvil
Copy link

Are my messages currently being deleted? I swear this is the 3rd time i post a message here, last time i even attached a video.

This issue is getting old. Adding morphology doesn't fix the problem, and even searching for kanjinized words you get the same results...tons of unrelated entries by all dictionaries of the group of search.
Video:

goldendict_wrong_entries.mp4

@shenlebantongying
Copy link
Collaborator

shenlebantongying commented Nov 20, 2024

が has two representations

As a standalone
(ga) U+304C HIRAGANA LETTER GA

Or as a combination of
(ka) か HIRAGANA LETTER KA
(ten or hw_ka) U+3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK (or one of https://en.wiktionary.org/wiki/%E3%82%9B)

Then we get the Unicode normalization's table8, both NFKD & NFKC will merge various combination of hiragana (?) letter and voice sound marks.

https://www.unicode.org/reports/tr15/tr15-56.html

image

Other references

So, the root issue appears to be the normalization process, which is considered as a feature in some other languages.

@darlopvil
Copy link

in other words, "we're not fixing this".

@xiaoyifang
Copy link
Owner

Can you provide the dictionary for testing ?

in other words, "we're not fixing this".

Maybe in the future ,this issue will be solved.

@darlopvil
Copy link

Can you provide the dictionary for testing ?

JMDict Furigana, JMDict+: https://jd4gd.com/jmdictplus.html
Jitendex: https://jitendex.org/pages/downloads.html
Edict+: https://jd4gd.com/edictplus.html

That's 4.

@xiaoyifang
Copy link
Owner

xiaoyifang commented Nov 21, 2024

image

not reproduced. search word 異なる

Maybe ,try to delete the dictionary's index and try again.
or try to disable some of your other dictionaries , seems like some synonymy word search?

image

@shenlebantongying
Copy link
Collaborator

I think the issue is both and are showing in the results ?

image

@xiaoyifang
Copy link
Owner

xiaoyifang commented Nov 21, 2024

from the video , the JMdict has show both the word 違う and 異なる (both mean different)
I wildly guess that maybe some synonymy strategy has been mixed in .
image

@darlopvil
Copy link

Maybe ,try to delete the dictionary's index and try again. or try to disable some of your other dictionaries , seems like some synonymy word search?

Did it and nothing changed. Also, i've deleted all these dicts from the directory and still the same issue with the rest of dicts.
Jitendex for example (made by other different author behaves the same)

@shenlebantongying
Copy link
Collaborator

shenlebantongying commented Nov 24, 2024

The fix is changing Unicode normalization strategy from NormalizationForm_KD to NormalizationForm_KC

.normalized( QString::NormalizationForm_KD )

After the change:


The technical reason behind is shown in this table https://unicode.org/reports/tr15/#NFKD_And_NFKC_Applied_Table

In NormalizationForm_KD or NFKD, the ga が will be decomposed into ka か + ten ◌゙, so when searching ga, both ka and ka+ten shows up.


Not sure what to do. Because changing that value will probably break other languages that consider characters with these marks as the same characters (Also not sure if this is true. Does NFKC more suitable here?).

It may also require all dicts to be reindexed.

@xiaoyifang
Copy link
Owner

xiaoyifang commented Nov 25, 2024

In NormalizationForm_KD or NFKD, the ga が will be decomposed into ka か + ten ◌゙, so when searching ga, both ka and ka+ten shows up.

This has not explained why search 異なる ,違う also show up. and can not explain why I can not reproduce it.

@xiaoyifang
Copy link
Owner

xiaoyifang commented Nov 25, 2024

Also, i've deleted all these dicts from the directory and still the same issue with the rest of dicts.

BTW,You do not have to delete the other dictionaries , just disable them .or only enable the tested dictionary.
or create a group for your tested dictionary to get rid of other dictionaries‘ impact.

@darlopvil
Copy link

well, the issue is that after taking off those dicts, the problem still didn't get fixed.
In fact, when searching for 異なる the issue persists. So i just left the dicts as they are. It's annoying af but at least i can use them to look up words.

This is how it looks now: (only showing three dicts from a bigger list)

imagen

@xiaoyifang
Copy link
Owner

xiaoyifang commented Dec 9, 2024

uncheck this option. Edit->preference->advanced->Extra search via synonyms
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants