Use jieba to cut Chinese directly instead charabia #659

ColinWttt · 2024-07-11T14:24:24Z

Currently charabia has wrong segmentation in Chinese and Japanese #591 ,1.1.1-alpha.1 not solving problem.

My native language is Chinese, and I am developing a web application. Therefore, I tried using jieba wasm on the web for months to perform a cut in processTerm. This worked very well.

  import init, { cut } from "@/wasm/web/jieba_rs_wasm";
...
     await init();

...
          processTerm: function (input: string) {
            //use jieba
            return cut(input, false).join("  ");
          },
...

So perhaps directly using jieba(not jieba-wasm) for Chinese segmentation, instead of charabia, could yield better results?

bglw · 2024-07-12T02:22:45Z

Hmm, under the hood charabia defers to jieba-rs for segmentation in Chinese, so I wouldn't expect any behavior to change with using jieba directly.

Is there something I'm missing?

ColinWttt · 2024-07-12T07:09:18Z

I don't quite understand why Charabia built on jieba-rs doesn't work, as both jieba-rs and jieba-wasm should be fine.

When searching in the browser, searching for 每個, 月, or 都 individually will work. Additionally, searching 每個月都 will return results containing each word in any order, and searching "每個月都" in quotes will match 每個月都 exactly.

Searching for 每個月都 will return zero results, as Pagefind is not able to segment it into words in the browser. Work to improve this is underway and will hopefully remove this limitation in the future.

As stated in the Pagefind documentation, by default, searching for "每个月都" will not yield correct results. However, after using Jieba in processTerm, it is equivalent to the user inputting "每个", "月", "都" instead of "每个月都". This way, the correct search results can be obtained.

("個" is traditional Chinese, "个"is simplified Chinese, the result should be the same.)

          processTerm: function (input: string) {
            //use jieba-wasm to cut
            console.log(cut(input, false).join("  "));
            return cut(input, false).join("  ");
          },

input "每个月都"

console log "每个 / 月 / 都"

bglw · 2024-07-12T23:56:43Z

Oh I see, yes Pagefind doesn't use anything for segmentation when searching — charabia is only used when indexing the site using the Pagefind binary, it's not a web dependency.

I have a couple ideas for making the frontend search better (to solve the limitations mentioned in the docs). Unfortunately using jieba in the frontend is a no-go. Pagefind's wasm is currently ~70kb, and adding jieba would make it closer to 3mb, which blows out Pagefind's primary goal of being low-bandwidth.

I do think there will be a way to solve this using the dictionary data that exists in the index, but I haven't had time to scope out that work yet.

ColinWttt changed the title ~~use jieba to cut Chinese directly instead charabia~~ Use jieba to cut Chinese directly instead charabia Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use jieba to cut Chinese directly instead charabia #659

Use jieba to cut Chinese directly instead charabia #659

ColinWttt commented Jul 11, 2024 •

edited

Loading

bglw commented Jul 12, 2024

ColinWttt commented Jul 12, 2024 •

edited

Loading

bglw commented Jul 12, 2024

Use jieba to cut Chinese directly instead charabia #659

Use jieba to cut Chinese directly instead charabia #659

Comments

ColinWttt commented Jul 11, 2024 • edited Loading

bglw commented Jul 12, 2024

ColinWttt commented Jul 12, 2024 • edited Loading

bglw commented Jul 12, 2024

ColinWttt commented Jul 11, 2024 •

edited

Loading

ColinWttt commented Jul 12, 2024 •

edited

Loading