Skip to content

Commit

Permalink
Merge branch 'main' into api-template-reconcillation
Browse files Browse the repository at this point in the history
  • Loading branch information
Naarcha-AWS authored Oct 9, 2024
2 parents cfa70b6 + cd31d82 commit 273267e
Show file tree
Hide file tree
Showing 4 changed files with 98 additions and 5 deletions.
6 changes: 3 additions & 3 deletions _analyzers/language-analyzers.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@ redirect_from:
- /query-dsl/analyzers/language-analyzers/
---

# Language analyzer
# Language analyzers

OpenSearch supports the following language values with the `analyzer` option:
OpenSearch supports the following language analyzers:
`arabic`, `armenian`, `basque`, `bengali`, `brazilian`, `bulgarian`, `catalan`, `czech`, `danish`, `dutch`, `english`, `estonian`, `finnish`, `french`, `galician`, `german`, `greek`, `hindi`, `hungarian`, `indonesian`, `irish`, `italian`, `latvian`, `lithuanian`, `norwegian`, `persian`, `portuguese`, `romanian`, `russian`, `sorani`, `spanish`, `swedish`, `turkish`, and `thai`.

To use the analyzer when you map an index, specify the value within your query. For example, to map your index with the French language analyzer, specify the `french` value for the analyzer field:
Expand Down Expand Up @@ -41,4 +41,4 @@ PUT my-index
}
```

<!-- TO do: each of the options needs its own section with an example. Convert table to individual sections, and then give a streamlined list with valid values. -->
<!-- TO do: each of the options needs its own section with an example. Convert table to individual sections, and then give a streamlined list with valid values. -->
93 changes: 93 additions & 0 deletions _analyzers/token-filters/classic.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
---
layout: default
title: Classic
parent: Token filters
nav_order: 50
---

# Classic token filter

The primary function of the classic token filter is to work alongside the classic tokenizer. It processes tokens by applying the following common transformations, which aid in text analysis and search:
- Removal of possessive endings such as *'s*. For example, *John's* becomes *John*.
- Removal of periods from acronyms. For example, *D.A.R.P.A.* becomes *DARPA*.


## Example

The following example request creates a new index named `custom_classic_filter` and configures an analyzer with the `classic` filter:

```json
PUT /custom_classic_filter
{
"settings": {
"analysis": {
"analyzer": {
"custom_classic": {
"type": "custom",
"tokenizer": "classic",
"filter": ["classic"]
}
}
}
}
}
```
{% include copy-curl.html %}

## Generated tokens

Use the following request to examine the tokens generated using the analyzer:

```json
POST /custom_classic_filter/_analyze
{
"analyzer": "custom_classic",
"text": "John's co-operate was excellent."
}
```
{% include copy-curl.html %}

The response contains the generated tokens:

```json
{
"tokens": [
{
"token": "John",
"start_offset": 0,
"end_offset": 6,
"type": "<APOSTROPHE>",
"position": 0
},
{
"token": "co",
"start_offset": 7,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "operate",
"start_offset": 10,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "was",
"start_offset": 18,
"end_offset": 21,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "excellent",
"start_offset": 22,
"end_offset": 31,
"type": "<ALPHANUM>",
"position": 4
}
]
}
```

2 changes: 1 addition & 1 deletion _analyzers/token-filters/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Token filter | Underlying Lucene token filter| Description
[`asciifolding`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/asciifolding/) | [ASCIIFoldingFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html) | Converts alphabetic, numeric, and symbolic characters.
`cjk_bigram` | [CJKBigramFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html) | Forms bigrams of Chinese, Japanese, and Korean (CJK) tokens.
[`cjk_width`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/cjk-width/) | [CJKWidthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html) | Normalizes Chinese, Japanese, and Korean (CJK) tokens according to the following rules: <br> - Folds full-width ASCII character variants into their equivalent basic Latin characters. <br> - Folds half-width katakana character variants into their equivalent kana characters.
`classic` | [ClassicFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/classic/ClassicFilter.html) | Performs optional post-processing on the tokens generated by the classic tokenizer. Removes possessives (`'s`) and removes `.` from acronyms.
[`classic`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/classic) | [ClassicFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/classic/ClassicFilter.html) | Performs optional post-processing on the tokens generated by the classic tokenizer. Removes possessives (`'s`) and removes `.` from acronyms.
`common_grams` | [CommonGramsFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/commongrams/CommonGramsFilter.html) | Generates bigrams for a list of frequently occurring terms. The output contains both single terms and bigrams.
`conditional` | [ConditionalTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ConditionalTokenFilter.html) | Applies an ordered list of token filters to tokens that match the conditions provided in a script.
`decimal_digit` | [DecimalDigitFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/DecimalDigitFilter.html) | Converts all digits in the Unicode decimal number general category to basic Latin digits (0--9).
Expand Down
2 changes: 1 addition & 1 deletion _dashboards/visualize/viz-index.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ Region maps show patterns and trends across geographic locations. A region map i

### Markdown

Markdown is a the markup language used in Dashboards to provide context to your data visualizations. Using Markdown, you can display information and instructions along with the visualization.
Markdown is the markup language used in Dashboards to provide context to your data visualizations. Using Markdown, you can display information and instructions along with the visualization.

<img src="{{site.url}}{{site.baseurl}}/images/dashboards/markdown.png" width="600" height="600" alt="Example coordinate map in OpenSearch Dashboards">

Expand Down

0 comments on commit 273267e

Please sign in to comment.