From 41b1b069f1ac51540743e60f48bc080c91017ce7 Mon Sep 17 00:00:00 2001 From: AntonEliatra Date: Fri, 13 Sep 2024 14:24:26 +0100 Subject: [PATCH] Add Cjk width token filter (#7917) * adding token filter page for cjk width #7875 Signed-off-by: AntonEliatra * adding details to the page Signed-off-by: AntonEliatra * adding details to the page Signed-off-by: AntonEliatra * Updating details as per comments Signed-off-by: AntonEliatra * Update cjk-width.md Signed-off-by: AntonEliatra * Apply suggestions from code review Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: AntonEliatra * Update cjk-width.md Signed-off-by: AntonEliatra * Update cjk-width.md Signed-off-by: AntonEliatra * Apply suggestions from code review Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: AntonEliatra --------- Signed-off-by: AntonEliatra Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower --- _analyzers/token-filters/cjk-width.md | 96 +++++++++++++++++++++++++++ _analyzers/token-filters/index.md | 2 +- 2 files changed, 97 insertions(+), 1 deletion(-) create mode 100644 _analyzers/token-filters/cjk-width.md diff --git a/_analyzers/token-filters/cjk-width.md b/_analyzers/token-filters/cjk-width.md new file mode 100644 index 0000000000..4960729cd1 --- /dev/null +++ b/_analyzers/token-filters/cjk-width.md @@ -0,0 +1,96 @@ +--- +layout: default +title: CJK width +parent: Token filters +nav_order: 40 +--- + +# CJK width token filter + +The `cjk_width` token filter normalizes Chinese, Japanese, and Korean (CJK) tokens by converting full-width ASCII characters to their standard (half-width) ASCII equivalents and half-width katakana characters to their full-width equivalents. + +### Converting full-width ASCII characters + +In CJK texts, ASCII characters (such as letters and numbers) can appear in full-width form, occupying the space of two half-width characters. Full-width ASCII characters are typically used in East Asian typography for alignment with the width of CJK characters. However, for the purposes of indexing and searching, these full-width characters need to be normalized to their standard (half-width) ASCII equivalents. + +The following example illustrates ASCII character normalization: + +``` + Full-Width: ABCDE 12345 + Normalized (half-width): ABCDE 12345 +``` + +### Converting half-width katakana characters + +The `cjk_width` token filter converts half-width katakana characters to their full-width counterparts, which are the standard form used in Japanese text. This normalization, illustrated in the following example, is important for consistency in text processing and searching: + + +``` + Half-Width katakana: カタカナ + Normalized (full-width) katakana: カタカナ +``` + +## Example + +The following example request creates a new index named `cjk_width_example_index` and defines an analyzer with the `cjk_width` filter: + +```json +PUT /cjk_width_example_index +{ + "settings": { + "analysis": { + "analyzer": { + "cjk_width_analyzer": { + "type": "custom", + "tokenizer": "standard", + "filter": ["cjk_width"] + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /cjk_width_example_index/_analyze +{ + "analyzer": "cjk_width_analyzer", + "text": "Tokyo 2024 カタカナ" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "Tokyo", + "start_offset": 0, + "end_offset": 5, + "type": "", + "position": 0 + }, + { + "token": "2024", + "start_offset": 6, + "end_offset": 10, + "type": "", + "position": 1 + }, + { + "token": "カタカナ", + "start_offset": 11, + "end_offset": 15, + "type": "", + "position": 2 + } + ] +} +``` diff --git a/_analyzers/token-filters/index.md b/_analyzers/token-filters/index.md index a9b621d5ab..86925123b8 100644 --- a/_analyzers/token-filters/index.md +++ b/_analyzers/token-filters/index.md @@ -16,7 +16,7 @@ Token filter | Underlying Lucene token filter| Description [`apostrophe`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/apostrophe/) | [ApostropheFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/tr/ApostropheFilter.html) | In each token containing an apostrophe, the `apostrophe` token filter removes the apostrophe itself and all characters following it. [`asciifolding`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/asciifolding/) | [ASCIIFoldingFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html) | Converts alphabetic, numeric, and symbolic characters. `cjk_bigram` | [CJKBigramFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html) | Forms bigrams of Chinese, Japanese, and Korean (CJK) tokens. -`cjk_width` | [CJKWidthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html) | Normalizes Chinese, Japanese, and Korean (CJK) tokens according to the following rules:
- Folds full-width ASCII character variants into the equivalent basic Latin characters.
- Folds half-width Katakana character variants into the equivalent Kana characters. +[`cjk_width`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/cjk-width/) | [CJKWidthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html) | Normalizes Chinese, Japanese, and Korean (CJK) tokens according to the following rules:
- Folds full-width ASCII character variants into their equivalent basic Latin characters.
- Folds half-width katakana character variants into their equivalent kana characters. `classic` | [ClassicFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/classic/ClassicFilter.html) | Performs optional post-processing on the tokens generated by the classic tokenizer. Removes possessives (`'s`) and removes `.` from acronyms. `common_grams` | [CommonGramsFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/commongrams/CommonGramsFilter.html) | Generates bigrams for a list of frequently occurring terms. The output contains both single terms and bigrams. `conditional` | [ConditionalTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ConditionalTokenFilter.html) | Applies an ordered list of token filters to tokens that match the conditions provided in a script.