Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support jieba tokenizer which is a popular chinese tokenizer #3205

Closed
wants to merge 1 commit into from

Conversation

SaintBacchus
Copy link
Collaborator

Add the jieba tokenizer into lance for chinese sentense.

@github-actions github-actions bot added enhancement New feature or request python labels Dec 5, 2024
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.

Project coverage is 78.66%. Comparing base (6e84834) to head (273d590).

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/inverted/tokenizer.rs 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3205      +/-   ##
==========================================
+ Coverage   78.62%   78.66%   +0.04%     
==========================================
  Files         243      243              
  Lines       82889    82892       +3     
  Branches    82889    82892       +3     
==========================================
+ Hits        65170    65206      +36     
+ Misses      14933    14902      -31     
+ Partials     2786     2784       -2     
Flag Coverage Δ
unittests 78.66% <0.00%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable. We can't really make this an optional dependency in Python (since we ship binaries), but it would be nice if we could in Rust.

Also, could you report the effect on binary size for the python wheels?

@@ -50,6 +50,7 @@ serde_json.workspace = true
serde.workspace = true
snafu.workspace = true
tantivy.workspace = true
tantivy-jieba.workspace = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this an optional dependency for lance-index?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wjones127 Is there any example about optional dependency for other rust lib? I'm not quite familiar with how to achieve it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://doc.rust-lang.org/cargo/reference/features.html#optional-dependencies

Suggested change
tantivy-jieba.workspace = true
tantivy-jieba = { workspace = true, optional = true }

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some suggestions demonstrating how to making tantivy-jiaba an optional dependency.

@@ -50,6 +50,7 @@ serde_json.workspace = true
serde.workspace = true
snafu.workspace = true
tantivy.workspace = true
tantivy-jieba.workspace = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://doc.rust-lang.org/cargo/reference/features.html#optional-dependencies

Suggested change
tantivy-jieba.workspace = true
tantivy-jieba = { workspace = true, optional = true }

Comment on lines +145 to +147
"jieba" => Ok(
tantivy::tokenizer::TextAnalyzer::builder(tantivy_jieba::JiebaTokenizer {}).dynamic(),
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"jieba" => Ok(
tantivy::tokenizer::TextAnalyzer::builder(tantivy_jieba::JiebaTokenizer {}).dynamic(),
),
#[cfg(feature = "tantivy-jieba")]
"jieba" => Ok(
tantivy::tokenizer::TextAnalyzer::builder(tantivy_jieba::JiebaTokenizer {}).dynamic(),
),

@@ -12,6 +12,7 @@ pub struct TokenizerConfig {
/// - `simple`: splits tokens on whitespace and punctuation
/// - `whitespace`: splits tokens on whitespace
/// - `raw`: no tokenization
/// - `jieba`: a popular chinese tokenization
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// - `jieba`: a popular chinese tokenization
/// - `jieba`: a popular chinese tokenization (enabled by `tantivy-jieba` feature)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To python/Cargo.toml, add:

lance-index = { path = "../rust/lance-index", features = ["tantivy-jieba"] }

@SaintBacchus
Copy link
Collaborator Author

Seems reasonable. We can't really make this an optional dependency in Python (since we ship binaries), but it would be nice if we could in Rust.

Also, could you report the effect on binary size for the python wheels?

Thanks for your suggestion. I tested the python wheel size in release mode:

Jieba Main
35M 32M

@chenkovsky
Copy link
Contributor

chenkovsky commented Dec 8, 2024

Seems reasonable. We can't really make this an optional dependency in Python (since we ship binaries), but it would be nice if we could in Rust.

Also, could you report the effect on binary size for the python wheels?

@wjones127 @SaintBacchus Chinese tokenizer contains a ngram language model. But we have to customize language model in different scenario. Can we have a mechanism that loads dynamic language model? for example, get language model path from env. then we can exclude language model from wheel. I created a PR to illustrate my idea: #3218

@wjones127
Copy link
Contributor

@wjones127 @SaintBacchus Chinese tokenizer contains a ngram language model. But we have to customize language model in different scenario. Can we have a mechanism that loads dynamic language model? for example, get language model path from env. then we can exclude language model from wheel. I created a PR to illustrate my idea: #3218

I'm not loving the additional code size added by each tokenizer. I do agree a plugin system would be better.

Here's my thinking:

  • For now, we put these special tokenizers under feature flags. Users who want chinese or korean tokenizations would unfortunately need to build pylance themselves. This isn't too hard though, thanks to maturin.
  • We should look at a tokenizer plugin API. If we can, we should ideally share a mechanism with tantivy. I created an issue to track this: Tokenizer plugins #3222

@SaintBacchus
Copy link
Collaborator Author

As #3218 merged, the jieba tokenizer was already privoded. This pr should be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants