Tantivy analysis

This is a collection of Tokenizer and TokenFilters for Tantivy that aims to replicate features available in Lucene.

It relies on Google's Rust ICU. libicu-dev and clang needs to be installed in order to compile.

Breaking word rules are from Lucene.

Features

icu feature includes the following components (they are also features) :
- ICUTokenizer
- ICUNormalizer2TokenFilter
- ICUTransformTokenFilter
commons features includes the following components
- LengthTokenFilter
- LimitTokenCountFilter
- PathTokenizer
- ReverseTokenFilter
- ElisionTokenFilter
- EdgeNgramTokenFilter
phonetic feature includes some phonetic algorithm (Beider-Morse, Soundex, Metaphone, ... see crate documentation)
- PhoneticTokenFilter
embedded which enables embedded rules of rphonetic crate. This feature is not included by default. It has two sub-features embedded-bm that enables only embedded Beider-Morse rules, and embedded-dm which enables only Daitch-Mokotoff rules.

Note that phonetic support probably needs improvements.

By default, icu, commons and phonetic are included.

Example

use tantivy::collector::TopDocs;
use tantivy::query::QueryParser;
use tantivy::schema::{IndexRecordOption, SchemaBuilder, TextFieldIndexing, TextOptions, Value};
use tantivy::tokenizer::TextAnalyzer;
use tantivy::{doc, Index, ReloadPolicy, TantivyDocument};
use tantivy_analysis_contrib::icu::{Direction, ICUTokenizer, ICUTransformTokenFilter};

const ANALYSIS_NAME: &str = "test";

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let options = TextOptions::default()
        .set_indexing_options(
            TextFieldIndexing::default()
                .set_tokenizer(ANALYSIS_NAME)
                .set_index_option(IndexRecordOption::WithFreqsAndPositions),
        )
        .set_stored();
    let mut schema = SchemaBuilder::new();
    schema.add_text_field("field", options);
    let schema = schema.build();

    let transform = ICUTransformTokenFilter::new(
        "Any-Latin; NFD; [:Nonspacing Mark:] Remove; Lower;  NFC".to_string(),
        None,
        Direction::Forward,
    )?;
    let icu_analyzer = TextAnalyzer::builder(ICUTokenizer)
        .filter(transform)
        .build();

    let field = schema.get_field("field").expect("Can't get field.");

    let index = Index::create_in_ram(schema);
    index.tokenizers().register(ANALYSIS_NAME, icu_analyzer);

    let mut index_writer = index.writer(15_000_000)?;

    index_writer.add_document(doc!(
        field => "中国"
    ))?;
    index_writer.add_document(doc!(
        field => "Another Document"
    ))?;

    index_writer.commit()?;

    let reader = index
        .reader_builder()
        .reload_policy(ReloadPolicy::Manual)
        .try_into()?;

    let searcher = reader.searcher();

    let query_parser = QueryParser::for_index(&index, vec![field]);

    let query = query_parser.parse_query("zhong")?;
    let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
    let mut result: Vec<String> = Vec::new();
    for (_, doc_address) in top_docs {
        let retrieved_doc = searcher.doc::<TantivyDocument>(doc_address)?;
        result = retrieved_doc
            .get_all(field)
            .map(|v| v.as_str().unwrap().to_string())
            .collect();
    }
    let expected: Vec<String> = vec!["中国".to_string()];
    assert_eq!(expected, result);

    let query = query_parser.parse_query("国")?;
    let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
    let mut result: Vec<String> = Vec::new();
    for (_, doc_address) in top_docs {
        let retrieved_doc = searcher.doc::<TantivyDocument>(doc_address)?;
        result = retrieved_doc
            .get_all(field)
            .map(|v| v.as_str().unwrap().to_string())
            .collect();
    }
    let expected: Vec<String> = vec!["中国".to_string()];
    assert_eq!(expected, result);
    let query = query_parser.parse_query("document")?;
    let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
    let mut result: Vec<String> = Vec::new();
    for (_, doc_address) in top_docs {
        let retrieved_doc = searcher.doc::<TantivyDocument>(doc_address)?;
        result = retrieved_doc
            .get_all(field)
            .map(|v| v.as_str().unwrap().to_string())
            .collect();
    }
    let expected: Vec<String> = vec!["Another Document".to_string()];
    assert_eq!(expected, result);
    Ok(())
}

License

Licensed under either of

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Name		Name	Last commit message	Last commit date
Latest commit History 204 Commits
.github		.github
examples		examples
src		src
test_assets		test_assets
.gitignore		.gitignore
.rustfmt.toml		.rustfmt.toml
CHANGELOG.md		CHANGELOG.md
Cargo.toml		Cargo.toml
LICENSE		LICENSE
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md
clippy.toml		clippy.toml
codecov.yml		codecov.yml
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

Tantivy analysis

Features

Example

License

Contribution

About

Licenses found

Releases 15

Packages

Contributors 4

Languages

License

Licenses found

Dalvany/tantivy-analysis-contrib

Folders and files

Latest commit

History

Repository files navigation

Tantivy analysis

Features

Example

License

Contribution

About

Topics

Resources

License

Licenses found

Stars

Watchers

Forks

Releases 15

Packages 0

Contributors 4

Languages

Packages