-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support jieba tokenizer which is a popular chinese tokenizer #3205
Conversation
5cf83ea
to
273d590
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3205 +/- ##
==========================================
+ Coverage 78.62% 78.66% +0.04%
==========================================
Files 243 243
Lines 82889 82892 +3
Branches 82889 82892 +3
==========================================
+ Hits 65170 65206 +36
+ Misses 14933 14902 -31
+ Partials 2786 2784 -2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable. We can't really make this an optional dependency in Python (since we ship binaries), but it would be nice if we could in Rust.
Also, could you report the effect on binary size for the python wheels?
@@ -50,6 +50,7 @@ serde_json.workspace = true | |||
serde.workspace = true | |||
snafu.workspace = true | |||
tantivy.workspace = true | |||
tantivy-jieba.workspace = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make this an optional dependency for lance-index
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wjones127 Is there any example about optional dependency for other rust lib? I'm not quite familiar with how to achieve it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://doc.rust-lang.org/cargo/reference/features.html#optional-dependencies
tantivy-jieba.workspace = true | |
tantivy-jieba = { workspace = true, optional = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some suggestions demonstrating how to making tantivy-jiaba
an optional dependency.
@@ -50,6 +50,7 @@ serde_json.workspace = true | |||
serde.workspace = true | |||
snafu.workspace = true | |||
tantivy.workspace = true | |||
tantivy-jieba.workspace = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://doc.rust-lang.org/cargo/reference/features.html#optional-dependencies
tantivy-jieba.workspace = true | |
tantivy-jieba = { workspace = true, optional = true } |
"jieba" => Ok( | ||
tantivy::tokenizer::TextAnalyzer::builder(tantivy_jieba::JiebaTokenizer {}).dynamic(), | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"jieba" => Ok( | |
tantivy::tokenizer::TextAnalyzer::builder(tantivy_jieba::JiebaTokenizer {}).dynamic(), | |
), | |
#[cfg(feature = "tantivy-jieba")] | |
"jieba" => Ok( | |
tantivy::tokenizer::TextAnalyzer::builder(tantivy_jieba::JiebaTokenizer {}).dynamic(), | |
), |
@@ -12,6 +12,7 @@ pub struct TokenizerConfig { | |||
/// - `simple`: splits tokens on whitespace and punctuation | |||
/// - `whitespace`: splits tokens on whitespace | |||
/// - `raw`: no tokenization | |||
/// - `jieba`: a popular chinese tokenization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// - `jieba`: a popular chinese tokenization | |
/// - `jieba`: a popular chinese tokenization (enabled by `tantivy-jieba` feature) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To python/Cargo.toml
, add:
lance-index = { path = "../rust/lance-index", features = ["tantivy-jieba"] }
Thanks for your suggestion. I tested the python wheel size in release mode:
|
@wjones127 @SaintBacchus Chinese tokenizer contains a ngram language model. But we have to customize language model in different scenario. Can we have a mechanism that loads dynamic language model? for example, get language model path from env. then we can exclude language model from wheel. I created a PR to illustrate my idea: #3218 |
I'm not loving the additional code size added by each tokenizer. I do agree a plugin system would be better. Here's my thinking:
|
As #3218 merged, the jieba tokenizer was already privoded. This pr should be closed. |
Add the jieba tokenizer into lance for chinese sentense.