Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use jieba to cut Chinese directly instead charabia #659

Open
ColinWttt opened this issue Jul 11, 2024 · 3 comments
Open

Use jieba to cut Chinese directly instead charabia #659

ColinWttt opened this issue Jul 11, 2024 · 3 comments

Comments

@ColinWttt
Copy link

ColinWttt commented Jul 11, 2024

Currently charabia has wrong segmentation in Chinese and Japanese #591 ,1.1.1-alpha.1 not solving problem.

My native language is Chinese, and I am developing a web application. Therefore, I tried using jieba wasm on the web for months to perform a cut in processTerm. This worked very well.

  import init, { cut } from "@/wasm/web/jieba_rs_wasm";
...
     await init();
...
          processTerm: function (input: string) {
            //use jieba
            return cut(input, false).join("  ");
          },
...

So perhaps directly using jieba(not jieba-wasm) for Chinese segmentation, instead of charabia, could yield better results?

@ColinWttt ColinWttt changed the title use jieba to cut Chinese directly instead charabia Use jieba to cut Chinese directly instead charabia Jul 11, 2024
@bglw
Copy link
Contributor

bglw commented Jul 12, 2024

Hmm, under the hood charabia defers to jieba-rs for segmentation in Chinese, so I wouldn't expect any behavior to change with using jieba directly.

Is there something I'm missing?

@ColinWttt
Copy link
Author

ColinWttt commented Jul 12, 2024

I don't quite understand why Charabia built on jieba-rs doesn't work, as both jieba-rs and jieba-wasm should be fine.

When searching in the browser, searching for 每個, 月, or 都 individually will work. Additionally, searching 每個 月 都 will return results containing each word in any order, and searching "每個 月 都" in quotes will match 每個月都 exactly.

Searching for 每個月都 will return zero results, as Pagefind is not able to segment it into words in the browser. Work to improve this is underway and will hopefully remove this limitation in the future.

As stated in the Pagefind documentation, by default, searching for "每个月都" will not yield correct results. However, after using Jieba in processTerm, it is equivalent to the user inputting "每个", "月", "都" instead of "每个月都". This way, the correct search results can be obtained.

("個" is traditional Chinese, "个"is simplified Chinese, the result should be the same.)

          processTerm: function (input: string) {
            //use jieba-wasm to cut
            console.log(cut(input, false).join("  "));
            return cut(input, false).join("  ");
          },

input "每个月都"
截屏2024-07-12 15 08 39

console log "每个 / 月 / 都"

@bglw
Copy link
Contributor

bglw commented Jul 12, 2024

Oh I see, yes Pagefind doesn't use anything for segmentation when searching — charabia is only used when indexing the site using the Pagefind binary, it's not a web dependency.

I have a couple ideas for making the frontend search better (to solve the limitations mentioned in the docs). Unfortunately using jieba in the frontend is a no-go. Pagefind's wasm is currently ~70kb, and adding jieba would make it closer to 3mb, which blows out Pagefind's primary goal of being low-bandwidth.

I do think there will be a way to solve this using the dictionary data that exists in the index, but I haven't had time to scope out that work yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants