Merge branch 'main' into linear-regression

opensearch-project · Jan 9, 2025 · e8682ee · e8682ee
2 parents f45547a + e7e36b5
commit e8682ee
Show file tree

Hide file tree

Showing 21 changed files with 1,308 additions and 66 deletions.
diff --git a/_analyzers/character-filters/html-character-filter.md b/_analyzers/character-filters/html-character-filter.md
@@ -9,7 +9,9 @@ nav_order: 100
 
 The `html_strip` character filter removes HTML tags, such as `<div>`, `<p>`, and `<a>`, from the input text and renders plain text. The filter can be configured to preserve certain tags or decode specific HTML entities, such as `&nbsp;`, into spaces.
 
-## Example: HTML analyzer
+## Example
+
+The following request applies an `html_strip` character filter to the provided text:
 
 ```json
 GET /_analyze
@@ -23,15 +25,35 @@ GET /_analyze
 ```
 {% include copy-curl.html %}
 
-Using the HTML analyzer, you can convert the HTML character entity references into their corresponding symbols. The processed text would read as follows:
+The response contains the token in which HTML characters have been converted to their decoded values:
 
-```
+```json
+{
+  "tokens": [
+    {
+      "token": """
 Commonly used calculus symbols include α, β and θ 
+""",
+      "start_offset": 0,
+      "end_offset": 74,
+      "type": "word",
+      "position": 0
+    }
+  ]
+}
 ```
 
+## Parameters
+
+The `html_strip` character filter can be configured with the following parameter.
+
+| Parameter       | Required/Optional | Data type | Description    |
+|:---|:---|:---|:---|
+| `escaped_tags` | Optional | Array of strings | An array of HTML element names, specified without the enclosing angle brackets (`< >`). The filter does not remove elements in this list when stripping HTML from the text. For example, setting the array to `["b", "i"]` will prevent the `<b>` and `<i>` elements from being stripped.|
+
 ## Example: Custom analyzer with lowercase filter
 
-The following example query creates a custom analyzer that strips HTML tags and converts the plain text to lowercase by using the `html_strip` analyzer and `lowercase` filter:
+The following example request creates a custom analyzer that strips HTML tags and converts the plain text to lowercase by using the `html_strip` analyzer and `lowercase` filter:
 
 ```json
 PUT /html_strip_and_lowercase_analyzer
@@ -57,9 +79,7 @@ PUT /html_strip_and_lowercase_analyzer
 ```
 {% include copy-curl.html %}
 
-### Testing `html_strip_and_lowercase_analyzer`
-
-You can run the following request to test the analyzer:
+Use the following request to examine the tokens generated using the analyzer:
 
 ```json
 GET /html_strip_and_lowercase_analyzer/_analyze
@@ -72,8 +92,32 @@ GET /html_strip_and_lowercase_analyzer/_analyze
 
 In the response, the HTML tags have been removed and the plain text has been converted to lowercase:
 
-```
-welcome to opensearch!
+```json
+{
+  "tokens": [
+    {
+      "token": "welcome",
+      "start_offset": 4,
+      "end_offset": 11,
+      "type": "<ALPHANUM>",
+      "position": 0
+    },
+    {
+      "token": "to",
+      "start_offset": 12,
+      "end_offset": 14,
+      "type": "<ALPHANUM>",
+      "position": 1
+    },
+    {
+      "token": "opensearch",
+      "start_offset": 23,
+      "end_offset": 42,
+      "type": "<ALPHANUM>",
+      "position": 2
+    }
+  ]
+}
 ```
 
 ## Example: Custom analyzer that preserves HTML tags
@@ -104,9 +148,7 @@ PUT /html_strip_preserve_analyzer
 ```
 {% include copy-curl.html %}
 
-### Testing `html_strip_preserve_analyzer`  
-
-You can run the following request to test the analyzer:
+Use the following request to examine the tokens generated using the analyzer:
 
 ```json
 GET /html_strip_preserve_analyzer/_analyze
@@ -119,6 +161,18 @@ GET /html_strip_preserve_analyzer/_analyze
 
 In the response, the `italic` and `bold` tags have been retained, as specified in the custom analyzer request:
 
-```
+```json
+{
+  "tokens": [
+    {
+      "token": """
 This is a <b>bold</b> and <i>italic</i> text.
+""",
+      "start_offset": 0,
+      "end_offset": 52,
+      "type": "word",
+      "position": 0
+    }
+  ]
+}
 ```
diff --git a/_analyzers/character-filters/index.md b/_analyzers/character-filters/index.md
@@ -14,6 +14,6 @@ Unlike token filters, which operate on tokens (words or terms), character filter
 
 Use cases for character filters include:
 
-- **HTML stripping:** Removes HTML tags from content so that only the plain text is indexed.
-- **Pattern replacement:** Replaces or removes unwanted characters or patterns in text, for example, converting hyphens to spaces.
-- **Custom mappings:** Substitutes specific characters or sequences with other values, for example, to convert currency symbols into their textual equivalents.
+- **HTML stripping**: The [`html_strip`]({{site.url}}{{site.baseurl}}/analyzers/character-filters/html-character-filter/) character filter removes HTML tags from content so that only the plain text is indexed.
+- **Pattern replacement**: The [`pattern_replace`]({{site.url}}{{site.baseurl}}/analyzers/character-filters/pattern-replace-character-filter/) character filter replaces or removes unwanted characters or patterns in text, for example, converting hyphens to spaces.
+- **Custom mappings**: The [`mapping`]({{site.url}}{{site.baseurl}}/analyzers/character-filters/mapping-character-filter/) character filter substitutes specific characters or sequences with other values, for example, to convert currency symbols into their textual equivalents.
diff --git a/_analyzers/character-filters/mapping-character-filter.md b/_analyzers/character-filters/mapping-character-filter.md
@@ -0,0 +1,125 @@
+---
+layout: default
+title: Mapping
+parent: Character filters
+nav_order: 120
+---
+
+# Mapping character filter
+
+The `mapping` character filter accepts a map of key-value pairs for character replacement. Whenever the filter encounters a string of characters matching a key, it replaces them with the corresponding value. Replacement values can be empty strings.
+
+The filter applies greedy matching, meaning that the longest matching pattern is matched. 
+
+The `mapping` character filter helps in scenarios where specific text replacements are required before tokenization.
+
+## Example 
+
+The following request configures a `mapping` character filter that converts Roman numerals (such as I, II, or III) into their corresponding Arabic numerals (1, 2, and 3): 
+
+```json
+GET /_analyze
+{
+  "tokenizer": "keyword",
+  "char_filter": [
+    {
+      "type": "mapping",
+      "mappings": [
+        "I => 1",
+        "II => 2",
+        "III => 3",
+        "IV => 4",
+        "V => 5"
+      ]
+    }
+  ],
+  "text": "I have III apples and IV oranges"
+}
+```
+{% include copy-curl.html %}
+
+The response contains a token where Roman numerals have been replaced with Arabic numerals:
+
+```json
+{
+  "tokens": [
+    {
+      "token": "1 have 3 apples and 4 oranges",
+      "start_offset": 0,
+      "end_offset": 32,
+      "type": "word",
+      "position": 0
+    }
+  ]
+}
+```
+
+## Parameters
+
+You can use either of the following parameters to configure the key-value map.
+
+| Parameter       | Required/Optional | Data type | Description    |
+|:---|:---|:---|:---|
+| `mappings`       | Optional          | Array      | An array of key-value pairs in the format `key => value`. Each key found in the input text will be replaced with its corresponding value. |
+| `mappings_path`  | Optional          | String     | The path to a UTF-8 encoded file containing key-value mappings. Each mapping should appear on a new line in the format `key => value`. The path can be absolute or relative to the OpenSearch configuration directory. |
+
+### Using a custom mapping character filter
+
+You can create a custom mapping character filter by defining your own set of mappings. The following request creates a custom character filter that replaces common abbreviations in a text:
+
+```json
+PUT /test-index
+{
+  "settings": {
+    "analysis": {
+      "analyzer": {
+        "custom_abbr_analyzer": {
+          "tokenizer": "standard",
+          "char_filter": [
+            "custom_abbr_filter"
+          ]
+        }
+      },
+      "char_filter": {
+        "custom_abbr_filter": {
+          "type": "mapping",
+          "mappings": [
+            "BTW => By the way",
+            "IDK => I don't know",
+            "FYI => For your information"
+          ]
+        }
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+GET /text-index/_analyze
+{
+  "tokenizer": "keyword",
+  "char_filter": [ "custom_abbr_filter" ],
+  "text": "FYI, updates to the workout schedule are posted. IDK when it takes effect, but we have some details. BTW, the finalized schedule will be released Monday."
+}
+```
+{% include copy-curl.html %}
+
+The response shows that the abbreviations were replaced:
+
+```json
+{
+  "tokens": [
+    {
+      "token": "For your information, updates to the workout schedule are posted. I don't know when it takes effect, but we have some details. By the way, the finalized schedule will be released Monday.",
+      "start_offset": 0,
+      "end_offset": 153,
+      "type": "word",
+      "position": 0
+    }
+  ]
+}
+```