-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use jieba to cut Chinese directly instead charabia #659
Comments
Hmm, under the hood charabia defers to jieba-rs for segmentation in Chinese, so I wouldn't expect any behavior to change with using jieba directly. Is there something I'm missing? |
I don't quite understand why Charabia built on jieba-rs doesn't work, as both jieba-rs and jieba-wasm should be fine.
As stated in the Pagefind documentation, by default, searching for "每个月都" will not yield correct results. However, after using Jieba in ("個" is traditional Chinese, "个"is simplified Chinese, the result should be the same.) processTerm: function (input: string) {
//use jieba-wasm to cut
console.log(cut(input, false).join(" "));
return cut(input, false).join(" ");
}, console log "每个 / 月 / 都" |
Oh I see, yes Pagefind doesn't use anything for segmentation when searching — charabia is only used when indexing the site using the Pagefind binary, it's not a web dependency. I have a couple ideas for making the frontend search better (to solve the limitations mentioned in the docs). Unfortunately using jieba in the frontend is a no-go. Pagefind's wasm is currently ~70kb, and adding jieba would make it closer to 3mb, which blows out Pagefind's primary goal of being low-bandwidth. I do think there will be a way to solve this using the dictionary data that exists in the index, but I haven't had time to scope out that work yet. |
Currently charabia has wrong segmentation in Chinese and Japanese #591 ,1.1.1-alpha.1 not solving problem.
My native language is Chinese, and I am developing a web application. Therefore, I tried using jieba wasm on the web for months to perform a cut in processTerm. This worked very well.
So perhaps directly using jieba(not jieba-wasm) for Chinese segmentation, instead of charabia, could yield better results?
The text was updated successfully, but these errors were encountered: