feat: support jieba tokenizer which is a popular chinese tokenizer #3205

SaintBacchus · 2024-12-05T07:04:53Z

Add the jieba tokenizer into lance for chinese sentense.

codecov-commenter · 2024-12-05T08:30:23Z

Codecov Report

Attention: Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.

Project coverage is 78.66%. Comparing base (6e84834) to head (273d590).

Files with missing lines	Patch %	Lines
rust/lance-index/src/scalar/inverted/tokenizer.rs	0.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3205      +/-   ##
==========================================
+ Coverage   78.62%   78.66%   +0.04%     
==========================================
  Files         243      243              
  Lines       82889    82892       +3     
  Branches    82889    82892       +3     
==========================================
+ Hits        65170    65206      +36     
+ Misses      14933    14902      -31     
+ Partials     2786     2784       -2

Flag	Coverage Δ
unittests	`78.66% <0.00%> (+0.04%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

wjones127

Seems reasonable. We can't really make this an optional dependency in Python (since we ship binaries), but it would be nice if we could in Rust.

Also, could you report the effect on binary size for the python wheels?

wjones127 · 2024-12-05T17:05:07Z

rust/lance-index/Cargo.toml

@@ -50,6 +50,7 @@ serde_json.workspace = true
 serde.workspace = true
 snafu.workspace = true
 tantivy.workspace = true
+tantivy-jieba.workspace = true


Could we make this an optional dependency for lance-index?

@wjones127 Is there any example about optional dependency for other rust lib? I'm not quite familiar with how to achieve it.

https://doc.rust-lang.org/cargo/reference/features.html#optional-dependencies

Suggested change

tantivy-jieba.workspace = true

tantivy-jieba = { workspace = true, optional = true }

wjones127

Added some suggestions demonstrating how to making tantivy-jiaba an optional dependency.

wjones127 · 2024-12-06T17:07:23Z

rust/lance-index/Cargo.toml

@@ -50,6 +50,7 @@ serde_json.workspace = true
 serde.workspace = true
 snafu.workspace = true
 tantivy.workspace = true
+tantivy-jieba.workspace = true


https://doc.rust-lang.org/cargo/reference/features.html#optional-dependencies

Suggested change

tantivy-jieba.workspace = true

tantivy-jieba = { workspace = true, optional = true }

wjones127 · 2024-12-06T17:08:02Z

rust/lance-index/src/scalar/inverted/tokenizer.rs

+        "jieba" => Ok(
+            tantivy::tokenizer::TextAnalyzer::builder(tantivy_jieba::JiebaTokenizer {}).dynamic(),
+        ),


Suggested change

"jieba" => Ok(

tantivy::tokenizer::TextAnalyzer::builder(tantivy_jieba::JiebaTokenizer {}).dynamic(),

),

#[cfg(feature = "tantivy-jieba")]

"jieba" => Ok(

tantivy::tokenizer::TextAnalyzer::builder(tantivy_jieba::JiebaTokenizer {}).dynamic(),

),

wjones127 · 2024-12-06T17:08:16Z

rust/lance-index/src/scalar/inverted/tokenizer.rs

@@ -12,6 +12,7 @@ pub struct TokenizerConfig {
    /// - `simple`: splits tokens on whitespace and punctuation
    /// - `whitespace`: splits tokens on whitespace
    /// - `raw`: no tokenization
+    /// - `jieba`: a popular chinese tokenization


Suggested change

/// - `jieba`: a popular chinese tokenization

/// - `jieba`: a popular chinese tokenization (enabled by `tantivy-jieba` feature)

wjones127 · 2024-12-06T17:10:54Z

python/Cargo.lock

To python/Cargo.toml, add:

lance-index = { path = "../rust/lance-index", features = ["tantivy-jieba"] }

SaintBacchus · 2024-12-07T03:37:59Z

Seems reasonable. We can't really make this an optional dependency in Python (since we ship binaries), but it would be nice if we could in Rust.

Also, could you report the effect on binary size for the python wheels?

Thanks for your suggestion. I tested the python wheel size in release mode:

Jieba	Main
35M	32M

chenkovsky · 2024-12-08T06:34:36Z

Seems reasonable. We can't really make this an optional dependency in Python (since we ship binaries), but it would be nice if we could in Rust.

Also, could you report the effect on binary size for the python wheels?

@wjones127 @SaintBacchus Chinese tokenizer contains a ngram language model. But we have to customize language model in different scenario. Can we have a mechanism that loads dynamic language model? for example, get language model path from env. then we can exclude language model from wheel. I created a PR to illustrate my idea: #3218

wjones127 · 2024-12-09T18:48:00Z

@wjones127 @SaintBacchus Chinese tokenizer contains a ngram language model. But we have to customize language model in different scenario. Can we have a mechanism that loads dynamic language model? for example, get language model path from env. then we can exclude language model from wheel. I created a PR to illustrate my idea: #3218

I'm not loving the additional code size added by each tokenizer. I do agree a plugin system would be better.

Here's my thinking:

For now, we put these special tokenizers under feature flags. Users who want chinese or korean tokenizations would unfortunately need to build pylance themselves. This isn't too hard though, thanks to maturin.
We should look at a tokenizer plugin API. If we can, we should ideally share a mechanism with tantivy. I created an issue to track this: Tokenizer plugins #3222

SaintBacchus · 2025-01-08T02:51:38Z

As #3218 merged, the jieba tokenizer was already privoded. This pr should be closed.

github-actions bot added enhancement New feature or request python labels Dec 5, 2024

feat: support jieba tokenizer which is a popular chinese tokenizer

273d590

SaintBacchus force-pushed the SupportJieba branch from 5cf83ea to 273d590 Compare December 5, 2024 08:06

wjones127 requested changes Dec 5, 2024

View reviewed changes

wjones127 reviewed Dec 6, 2024

View reviewed changes

SaintBacchus closed this Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support jieba tokenizer which is a popular chinese tokenizer #3205

feat: support jieba tokenizer which is a popular chinese tokenizer #3205

SaintBacchus commented Dec 5, 2024

codecov-commenter commented Dec 5, 2024

wjones127 left a comment

wjones127 Dec 5, 2024

SaintBacchus Dec 6, 2024

wjones127 Dec 6, 2024

wjones127 left a comment

wjones127 Dec 6, 2024

wjones127 Dec 6, 2024

wjones127 Dec 6, 2024

wjones127 Dec 6, 2024

SaintBacchus commented Dec 7, 2024

chenkovsky commented Dec 8, 2024 •

edited

Loading

wjones127 commented Dec 9, 2024

SaintBacchus commented Jan 8, 2025

	tantivy-jieba.workspace = true
	tantivy-jieba = { workspace = true, optional = true }

	/// - `jieba`: a popular chinese tokenization
	/// - `jieba`: a popular chinese tokenization (enabled by `tantivy-jieba` feature)

feat: support jieba tokenizer which is a popular chinese tokenizer #3205

feat: support jieba tokenizer which is a popular chinese tokenizer #3205

Conversation

SaintBacchus commented Dec 5, 2024

codecov-commenter commented Dec 5, 2024

Codecov Report

wjones127 left a comment

Choose a reason for hiding this comment

wjones127 Dec 5, 2024

Choose a reason for hiding this comment

SaintBacchus Dec 6, 2024

Choose a reason for hiding this comment

wjones127 Dec 6, 2024

Choose a reason for hiding this comment

wjones127 left a comment

Choose a reason for hiding this comment

wjones127 Dec 6, 2024

Choose a reason for hiding this comment

wjones127 Dec 6, 2024

Choose a reason for hiding this comment

wjones127 Dec 6, 2024

Choose a reason for hiding this comment

wjones127 Dec 6, 2024

Choose a reason for hiding this comment

SaintBacchus commented Dec 7, 2024

chenkovsky commented Dec 8, 2024 • edited Loading

wjones127 commented Dec 9, 2024

SaintBacchus commented Jan 8, 2025

chenkovsky commented Dec 8, 2024 •

edited

Loading