Skip to content

Commit

Permalink
Merge branch 'main' into CJK-bigram-token-filter-page
Browse files Browse the repository at this point in the history
  • Loading branch information
kolchfa-aws authored Sep 13, 2024
2 parents 9943b38 + 41b1b06 commit fe64076
Show file tree
Hide file tree
Showing 13 changed files with 138 additions and 11 deletions.
96 changes: 96 additions & 0 deletions _analyzers/token-filters/cjk-width.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
---
layout: default
title: CJK width
parent: Token filters
nav_order: 40
---

# CJK width token filter

The `cjk_width` token filter normalizes Chinese, Japanese, and Korean (CJK) tokens by converting full-width ASCII characters to their standard (half-width) ASCII equivalents and half-width katakana characters to their full-width equivalents.

### Converting full-width ASCII characters

In CJK texts, ASCII characters (such as letters and numbers) can appear in full-width form, occupying the space of two half-width characters. Full-width ASCII characters are typically used in East Asian typography for alignment with the width of CJK characters. However, for the purposes of indexing and searching, these full-width characters need to be normalized to their standard (half-width) ASCII equivalents.

The following example illustrates ASCII character normalization:

```
Full-Width: ABCDE 12345
Normalized (half-width): ABCDE 12345
```

### Converting half-width katakana characters

The `cjk_width` token filter converts half-width katakana characters to their full-width counterparts, which are the standard form used in Japanese text. This normalization, illustrated in the following example, is important for consistency in text processing and searching:


```
Half-Width katakana: カタカナ
Normalized (full-width) katakana: カタカナ
```

## Example

The following example request creates a new index named `cjk_width_example_index` and defines an analyzer with the `cjk_width` filter:

```json
PUT /cjk_width_example_index
{
"settings": {
"analysis": {
"analyzer": {
"cjk_width_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["cjk_width"]
}
}
}
}
}
```
{% include copy-curl.html %}

## Generated tokens

Use the following request to examine the tokens generated using the analyzer:

```json
POST /cjk_width_example_index/_analyze
{
"analyzer": "cjk_width_analyzer",
"text": "Tokyo 2024 カタカナ"
}
```
{% include copy-curl.html %}

The response contains the generated tokens:

```json
{
"tokens": [
{
"token": "Tokyo",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "2024",
"start_offset": 6,
"end_offset": 10,
"type": "<NUM>",
"position": 1
},
{
"token": "カタカナ",
"start_offset": 11,
"end_offset": 15,
"type": "<KATAKANA>",
"position": 2
}
]
}
```
2 changes: 1 addition & 1 deletion _analyzers/token-filters/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Token filter | Underlying Lucene token filter| Description
[`apostrophe`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/apostrophe/) | [ApostropheFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/tr/ApostropheFilter.html) | In each token containing an apostrophe, the `apostrophe` token filter removes the apostrophe itself and all characters following it.
[`asciifolding`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/asciifolding/) | [ASCIIFoldingFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html) | Converts alphabetic, numeric, and symbolic characters.
`cjk_bigram` | [CJKBigramFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html) | Forms bigrams of Chinese, Japanese, and Korean (CJK) tokens.
`cjk_width` | [CJKWidthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html) | Normalizes Chinese, Japanese, and Korean (CJK) tokens according to the following rules: <br> - Folds full-width ASCII character variants into the equivalent basic Latin characters. <br> - Folds half-width Katakana character variants into the equivalent Kana characters.
[`cjk_width`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/cjk-width/) | [CJKWidthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html) | Normalizes Chinese, Japanese, and Korean (CJK) tokens according to the following rules: <br> - Folds full-width ASCII character variants into their equivalent basic Latin characters. <br> - Folds half-width katakana character variants into their equivalent kana characters.
`classic` | [ClassicFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/classic/ClassicFilter.html) | Performs optional post-processing on the tokens generated by the classic tokenizer. Removes possessives (`'s`) and removes `.` from acronyms.
`common_grams` | [CommonGramsFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/commongrams/CommonGramsFilter.html) | Generates bigrams for a list of frequently occurring terms. The output contains both single terms and bigrams.
`conditional` | [ConditionalTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ConditionalTokenFilter.html) | Applies an ordered list of token filters to tokens that match the conditions provided in a script.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ layout: default
title: documentdb
parent: Sources
grand_parent: Pipelines
nav_order: 2
nav_order: 10
---

# documentdb
Expand Down
2 changes: 1 addition & 1 deletion _data-prepper/pipelines/configuration/sources/dynamo-db.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ layout: default
title: dynamodb
parent: Sources
grand_parent: Pipelines
nav_order: 3
nav_order: 20
---

# dynamodb
Expand Down
2 changes: 1 addition & 1 deletion _data-prepper/pipelines/configuration/sources/http.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ layout: default
title: http
parent: Sources
grand_parent: Pipelines
nav_order: 5
nav_order: 30
redirect_from:
- /data-prepper/pipelines/configuration/sources/http-source/
---
Expand Down
2 changes: 1 addition & 1 deletion _data-prepper/pipelines/configuration/sources/kafka.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ layout: default
title: kafka
parent: Sources
grand_parent: Pipelines
nav_order: 6
nav_order: 40
---

# kafka
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ layout: default
title: opensearch
parent: Sources
grand_parent: Pipelines
nav_order: 30
nav_order: 50
---

# opensearch
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ layout: default
title: otel_logs_source
parent: Sources
grand_parent: Pipelines
nav_order: 25
nav_order: 60
---

# otel_logs_source
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ layout: default
title: otel_metrics_source
parent: Sources
grand_parent: Pipelines
nav_order: 10
nav_order: 70
---

# otel_metrics_source
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ layout: default
title: otel_trace_source
parent: Sources
grand_parent: Pipelines
nav_order: 15
nav_order: 80
redirect_from:
- /data-prepper/pipelines/configuration/sources/otel-trace/
---
Expand Down
31 changes: 31 additions & 0 deletions _data-prepper/pipelines/configuration/sources/pipeline.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
---
layout: default
title: pipeline
parent: Sources
grand_parent: Pipelines
nav_order: 90
---

# pipeline

Use the `pipeline` sink to read from another pipeline.

## Configuration options

The `pipeline` sink supports the following configuration options.

| Option | Required | Type | Description |
|:-------|:---------|:-------|:---------------------------------------|
| `name` | Yes | String | The name of the pipeline to read from. |

## Usage

The following example configures a `pipeline` sink named `sample-pipeline` that reads from a pipeline named `movies`:

```yaml
sample-pipeline:
source:
- pipeline:
name: "movies"
```
{% include copy.html %}
2 changes: 1 addition & 1 deletion _data-prepper/pipelines/configuration/sources/s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ layout: default
title: s3 source
parent: Sources
grand_parent: Pipelines
nav_order: 20
nav_order: 100
---

# s3 source
Expand Down
2 changes: 1 addition & 1 deletion _data-prepper/pipelines/configuration/sources/sources.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ layout: default
title: Sources
parent: Pipelines
has_children: true
nav_order: 20
nav_order: 110
---

# Sources
Expand Down

0 comments on commit fe64076

Please sign in to comment.