Skip to content

Commit

Permalink
doc: update README and doc comment
Browse files Browse the repository at this point in the history
  • Loading branch information
Gowee committed Jul 1, 2023
1 parent e9776f4 commit a899a76
Show file tree
Hide file tree
Showing 2 changed files with 14 additions and 6 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
# zhconv-rs 中文简繁及地區詞轉換
zhconv-rs converts Chinese text among traditional/simplified scripts or regional variants (e.g. `zh-TW <-> zh-CN <-> zh-HK <-> zh-Hans <-> zh-Hant`), built on the top of rulesets from MediaWiki/Wikipedia and OpenCC.

The implementation is powered by the [Aho-Corasick](https://github.com/BurntSushi/aho-corasick) automaton, ensuring linear time complexity with respect to the length of input text and conversion rules (`O(n+m)`), processing dozens of MiBs text per second.
The implementation is powered by an [Aho-Corasick](https://github.com/daac-tools/daachorse) automaton, ensuring linear time complexity with respect to the length of input text and conversion rules (`O(n+m)`), processing dozens of MiBs text per second.

🔗 **Web App: https://zhconv.pages.dev** (powered by WASM)

Expand Down Expand Up @@ -120,7 +120,7 @@ The benchmark was performed on a previous version that had only Mediawiki rulese
* OpenCC: The [conversion rulesets](https://github.com/BYVoid/OpenCC/tree/master/data/dictionary) of OpenCC is independent of MediaWiki. The core [conversion implementation](https://github.dev/BYVoid/OpenCC/blob/21995f5ea058441423aaff3ee89b0a5d4747674c/src/Conversion.cpp#L27) of OpenCC is kinda similar to the aforementioned `strtr`. However, OpenCC supports pre-segmentation and maintains multiple rulesets which are applied successively. By contrast, the Aho-Corasick-powered zhconv-rs merges rulesets from MediaWiki and OpenCC in compile time and converts text in single-pass linear time, resulting in much more efficiency. Though, conversion results may differ in some cases.

## Limitations
The converter is based on an aho-corasick automaton with the leftmost-longest matching strategy. This strategy gives priority to the leftmost-matched words or phrases. For instance, if a ruleset includes both `干 -> 幹` and `天干物燥 -> 天乾物燥`, the converter would prioritize `天乾物燥` because `天干物燥` gets matched earlier compared to `` at a later position. The strategy yields good results in general, but may occasionally lead to wrong conversions.
The converter takes leftmost-longest matching strategy. It gives priority to the leftmost-matched words or phrases. For instance, if a ruleset includes both `干 -> 幹` and `天干物燥 -> 天乾物燥`, the converter would prioritize `天乾物燥` because `天干物燥` gets matched earlier compared to `` at a later position. The strategy yields good results in general, but may occasionally lead to wrong conversions.

The implementation support most of the MediaWiki conversion rules. But it is not fully compliant with the original implementation.

Expand Down
16 changes: 12 additions & 4 deletions src/lib.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
//! This crate provides a ZhConverter that converts Chinese variants among each other. The
//! implementation is based on the [Aho-Corasick](https://docs.rs/aho-corasick/latest) automaton
//! implementation is based on the [Aho-Corasick](https://docs.rs/daachorse) algorithm
//! with the leftmost-longest matching strategy and linear time complexity with respect to the
//! length of input text and conversion rules. It ships with a bunch of conversion tables,
//! extracted from [zhConversion.php](https://phabricator.wikimedia.org/source/mediawiki/browse/master/includes/languages/data/ZhConversion.php)
Expand Down Expand Up @@ -27,7 +27,7 @@
//! Basic conversion:
//! ```
//! use zhconv::{zhconv, Variant};
//! assert_eq!(zhconv("天干物燥 小心火烛", Variant::ZhHant), "天乾物燥 小心火燭");
//! assert_eq!(zhconv("天干物燥 小心火烛", "zh-Hant".parse().unwrap()), "天乾物燥 小心火燭");
//! assert_eq!(zhconv("鼠曲草", Variant::ZhHant), "鼠麴草");
//! assert_eq!(zhconv("阿拉伯联合酋长国", Variant::ZhHant), "阿拉伯聯合酋長國");
//! assert_eq!(zhconv("阿拉伯联合酋长国", Variant::ZhTW), "阿拉伯聯合大公國");
Expand All @@ -45,6 +45,14 @@
//! To load or add additional conversion rules such as CGroups or `(FROM, TO)` pairs,
//! see [`ZhConverterBuilder`].
//!
//! Other useful function:
//! ```
//! use zhconv::{is_hans, is_hans_confidence, infer_variant, infer_variant_confidence};
//! assert!(!is_hans("秋冬濁而春夏清,晞於朝而生於夕"));
//! assert!(is_hans_confidence("滴瀝明花苑,葳蕤泫竹叢") < 0.5);
//! println!("{}", infer_variant("錦字緘愁過薊水,寒衣將淚到遼城"));
//! println!("{:?}", infer_variant_confidence("zhconv-rs 中文简繁及地區詞轉換"));
//! ```

mod converter;
mod utils;
Expand Down Expand Up @@ -89,8 +97,8 @@ pub fn zhconv(text: &str, target: Variant) -> String {
/// `n` is input text length and `m` is the maximum lengths of source words in conversion rulesets.
///
/// In case global rules support are not expected, it is better to use
/// `get_builtin_converter(target).convert_as_wikitext_basic(text)` instead, which runs in O(n)
/// in general.
/// `get_builtin_converter(target).convert_as_wikitext_basic(text)` instead, which incurs no extra
/// overhead.
///
// /// Different from the implementation of MediaWiki, this crate use a automaton which makes it
// /// infeasible to mutate global rules during converting. So the function always searches the text
Expand Down

0 comments on commit a899a76

Please sign in to comment.