Skip to content

Commit

Permalink
Merge branch 'main' into linear-regression
Browse files Browse the repository at this point in the history
  • Loading branch information
kolchfa-aws authored Jan 9, 2025
2 parents f45547a + e7e36b5 commit e8682ee
Show file tree
Hide file tree
Showing 21 changed files with 1,308 additions and 66 deletions.
80 changes: 67 additions & 13 deletions _analyzers/character-filters/html-character-filter.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,9 @@ nav_order: 100

The `html_strip` character filter removes HTML tags, such as `<div>`, `<p>`, and `<a>`, from the input text and renders plain text. The filter can be configured to preserve certain tags or decode specific HTML entities, such as `&nbsp;`, into spaces.

## Example: HTML analyzer
## Example

The following request applies an `html_strip` character filter to the provided text:

```json
GET /_analyze
Expand All @@ -23,15 +25,35 @@ GET /_analyze
```
{% include copy-curl.html %}

Using the HTML analyzer, you can convert the HTML character entity references into their corresponding symbols. The processed text would read as follows:
The response contains the token in which HTML characters have been converted to their decoded values:

```
```json
{
"tokens": [
{
"token": """
Commonly used calculus symbols include α, β and θ
""",
"start_offset": 0,
"end_offset": 74,
"type": "word",
"position": 0
}
]
}
```

## Parameters

The `html_strip` character filter can be configured with the following parameter.

| Parameter | Required/Optional | Data type | Description |
|:---|:---|:---|:---|
| `escaped_tags` | Optional | Array of strings | An array of HTML element names, specified without the enclosing angle brackets (`< >`). The filter does not remove elements in this list when stripping HTML from the text. For example, setting the array to `["b", "i"]` will prevent the `<b>` and `<i>` elements from being stripped.|

## Example: Custom analyzer with lowercase filter

The following example query creates a custom analyzer that strips HTML tags and converts the plain text to lowercase by using the `html_strip` analyzer and `lowercase` filter:
The following example request creates a custom analyzer that strips HTML tags and converts the plain text to lowercase by using the `html_strip` analyzer and `lowercase` filter:

```json
PUT /html_strip_and_lowercase_analyzer
Expand All @@ -57,9 +79,7 @@ PUT /html_strip_and_lowercase_analyzer
```
{% include copy-curl.html %}

### Testing `html_strip_and_lowercase_analyzer`

You can run the following request to test the analyzer:
Use the following request to examine the tokens generated using the analyzer:

```json
GET /html_strip_and_lowercase_analyzer/_analyze
Expand All @@ -72,8 +92,32 @@ GET /html_strip_and_lowercase_analyzer/_analyze

In the response, the HTML tags have been removed and the plain text has been converted to lowercase:

```
welcome to opensearch!
```json
{
"tokens": [
{
"token": "welcome",
"start_offset": 4,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "to",
"start_offset": 12,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "opensearch",
"start_offset": 23,
"end_offset": 42,
"type": "<ALPHANUM>",
"position": 2
}
]
}
```

## Example: Custom analyzer that preserves HTML tags
Expand Down Expand Up @@ -104,9 +148,7 @@ PUT /html_strip_preserve_analyzer
```
{% include copy-curl.html %}

### Testing `html_strip_preserve_analyzer`

You can run the following request to test the analyzer:
Use the following request to examine the tokens generated using the analyzer:

```json
GET /html_strip_preserve_analyzer/_analyze
Expand All @@ -119,6 +161,18 @@ GET /html_strip_preserve_analyzer/_analyze

In the response, the `italic` and `bold` tags have been retained, as specified in the custom analyzer request:

```
```json
{
"tokens": [
{
"token": """
This is a <b>bold</b> and <i>italic</i> text.
""",
"start_offset": 0,
"end_offset": 52,
"type": "word",
"position": 0
}
]
}
```
6 changes: 3 additions & 3 deletions _analyzers/character-filters/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,6 @@ Unlike token filters, which operate on tokens (words or terms), character filter

Use cases for character filters include:

- **HTML stripping:** Removes HTML tags from content so that only the plain text is indexed.
- **Pattern replacement:** Replaces or removes unwanted characters or patterns in text, for example, converting hyphens to spaces.
- **Custom mappings:** Substitutes specific characters or sequences with other values, for example, to convert currency symbols into their textual equivalents.
- **HTML stripping**: The [`html_strip`]({{site.url}}{{site.baseurl}}/analyzers/character-filters/html-character-filter/) character filter removes HTML tags from content so that only the plain text is indexed.
- **Pattern replacement**: The [`pattern_replace`]({{site.url}}{{site.baseurl}}/analyzers/character-filters/pattern-replace-character-filter/) character filter replaces or removes unwanted characters or patterns in text, for example, converting hyphens to spaces.
- **Custom mappings**: The [`mapping`]({{site.url}}{{site.baseurl}}/analyzers/character-filters/mapping-character-filter/) character filter substitutes specific characters or sequences with other values, for example, to convert currency symbols into their textual equivalents.
125 changes: 125 additions & 0 deletions _analyzers/character-filters/mapping-character-filter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
---
layout: default
title: Mapping
parent: Character filters
nav_order: 120
---

# Mapping character filter

The `mapping` character filter accepts a map of key-value pairs for character replacement. Whenever the filter encounters a string of characters matching a key, it replaces them with the corresponding value. Replacement values can be empty strings.

The filter applies greedy matching, meaning that the longest matching pattern is matched.

The `mapping` character filter helps in scenarios where specific text replacements are required before tokenization.

## Example

The following request configures a `mapping` character filter that converts Roman numerals (such as I, II, or III) into their corresponding Arabic numerals (1, 2, and 3):

```json
GET /_analyze
{
"tokenizer": "keyword",
"char_filter": [
{
"type": "mapping",
"mappings": [
"I => 1",
"II => 2",
"III => 3",
"IV => 4",
"V => 5"
]
}
],
"text": "I have III apples and IV oranges"
}
```
{% include copy-curl.html %}

The response contains a token where Roman numerals have been replaced with Arabic numerals:

```json
{
"tokens": [
{
"token": "1 have 3 apples and 4 oranges",
"start_offset": 0,
"end_offset": 32,
"type": "word",
"position": 0
}
]
}
```

## Parameters

You can use either of the following parameters to configure the key-value map.

| Parameter | Required/Optional | Data type | Description |
|:---|:---|:---|:---|
| `mappings` | Optional | Array | An array of key-value pairs in the format `key => value`. Each key found in the input text will be replaced with its corresponding value. |
| `mappings_path` | Optional | String | The path to a UTF-8 encoded file containing key-value mappings. Each mapping should appear on a new line in the format `key => value`. The path can be absolute or relative to the OpenSearch configuration directory. |

### Using a custom mapping character filter

You can create a custom mapping character filter by defining your own set of mappings. The following request creates a custom character filter that replaces common abbreviations in a text:

```json
PUT /test-index
{
"settings": {
"analysis": {
"analyzer": {
"custom_abbr_analyzer": {
"tokenizer": "standard",
"char_filter": [
"custom_abbr_filter"
]
}
},
"char_filter": {
"custom_abbr_filter": {
"type": "mapping",
"mappings": [
"BTW => By the way",
"IDK => I don't know",
"FYI => For your information"
]
}
}
}
}
}
```
{% include copy-curl.html %}

Use the following request to examine the tokens generated using the analyzer:

```json
GET /text-index/_analyze
{
"tokenizer": "keyword",
"char_filter": [ "custom_abbr_filter" ],
"text": "FYI, updates to the workout schedule are posted. IDK when it takes effect, but we have some details. BTW, the finalized schedule will be released Monday."
}
```
{% include copy-curl.html %}

The response shows that the abbreviations were replaced:

```json
{
"tokens": [
{
"token": "For your information, updates to the workout schedule are posted. I don't know when it takes effect, but we have some details. By the way, the finalized schedule will be released Monday.",
"start_offset": 0,
"end_offset": 153,
"type": "word",
"position": 0
}
]
}
```
Loading

0 comments on commit e8682ee

Please sign in to comment.