Merge branch 'main' into text_chunking_nested_example

opensearch-project · Mar 29, 2024 · 3002729 · 3002729
2 parents 1b6ae0d + 5d9edcb
commit 3002729
Show file tree

Hide file tree

Showing 77 changed files with 2,638 additions and 312 deletions.
diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
@@ -1 +1 @@
-*  @hdhalter @kolchfa-aws @Naarcha-AWS @vagimeli @AMoo-Miki @natebower @dlvenable @scrawfor99
+*  @hdhalter @kolchfa-aws @Naarcha-AWS @vagimeli @AMoo-Miki @natebower @dlvenable @scrawfor99 @epugh
diff --git a/.github/vale/styles/Vocab/OpenSearch/Words/accept.txt b/.github/vale/styles/Vocab/OpenSearch/Words/accept.txt
@@ -81,6 +81,7 @@ Levenshtein
 [Oo]nboarding
 pebibyte
 [Pp]erformant
+[Pp]laintext
 [Pp]luggable
 [Pp]reconfigure
 [Pp]refetch
@@ -92,6 +93,7 @@ pebibyte
 [Pp]reprocess
 [Pp]retrain
 [Pp]seudocode
+[Quantiz](e|ation|ing|er)
 [Rr]ebalance
 [Rr]ebalancing
 [Rr]edownload

diff --git a/MAINTAINERS.md b/MAINTAINERS.md
@@ -1,6 +1,6 @@
 ## Overview
 
-This document contains a list of maintainers in this repo. See [opensearch-project/.github/RESPONSIBILITIES.md](https://github.com/opensearch-project/.github/blob/main/RESPONSIBILITIES.md#maintainer-responsibilities) that explains what the role of maintainer means, what maintainers do in this and other repos, and how they should be doing it. If you're interested in contributing, and becoming a maintainer, see [CONTRIBUTING](CONTRIBUTING.md).
+This document lists the maintainers in this repo. See [opensearch-project/.github/RESPONSIBILITIES.md](https://github.com/opensearch-project/.github/blob/main/RESPONSIBILITIES.md#maintainer-responsibilities) for information about the role of a maintainer, what maintainers do in this and other repos, and how they should be doing it. If you're interested in contributing or becoming a maintainer, see [CONTRIBUTING](CONTRIBUTING.md).  
 
 ## Current Maintainers
 
@@ -9,8 +9,9 @@ This document contains a list of maintainers in this repo. See [opensearch-proje
 | Heather Halter   | [hdhalter](https://github.com/hdhalter)         | Amazon      |
 | Fanit Kolchina   | [kolchfa-aws](https://github.com/kolchfa-aws)   | Amazon      |
 | Nate Archer      | [Naarcha-AWS](https://github.com/Naarcha-AWS)   | Amazon      |
-| Nate Bower       | [natebower](https://github.com/natebower)       | Amazon      |
+| Nathan Bower     | [natebower](https://github.com/natebower)       | Amazon      |
 | Melissa Vagi     | [vagimeli](https://github.com/vagimeli)         | Amazon      |
 | Miki Barahmand   | [AMoo-Miki](https://github.com/AMoo-Miki)       | Amazon      |
 | David Venable    | [dlvenable](https://github.com/dlvenable)       | Amazon      | 
 | Stephen Crawford | [scraw99](https://github.com/scrawfor99)        | Amazon      |
+| Eric Pugh        | [epugh](https://github.com/epugh)               | OpenSource Connections  | 
diff --git a/_api-reference/document-apis/reindex.md b/_api-reference/document-apis/reindex.md
@@ -73,10 +73,11 @@ slice | Whether to manually or automatically slice the reindex operation so it e
 _source | Whether to reindex source fields. Specify a list of fields to reindex or true to reindex all fields. Default is true.
 id | The ID to associate with manual slicing.
 max | Maximum number of slices.
-dest | Information about the destination index. Valid values are `index`, `version_type`, and `op_type`.
+dest | Information about the destination index. Valid values are `index`, `version_type`, `op_type`, and `pipeline`.
 index | Name of the destination index.
 version_type | The indexing operation's version type. Valid values are `internal`, `external`, `external_gt` (retrieve the document if the specified version number is greater than the document’s current version), and `external_gte` (retrieve the document if the specified version number is greater or equal to than the document’s current version).
 op_type | Whether to copy over documents that are missing in the destination index. Valid values are `create` (ignore documents with the same ID from the source index) and `index` (copy everything from the source index).
+pipeline | Which ingest pipeline to utilize during the reindex.
 script | A script that OpenSearch uses to apply transformations to the data during the reindex operation.
 source | The actual script that OpenSearch runs.
 lang | The scripting language. Valid options are `painless`, `expression`, `mustache`, and `java`.

diff --git a/_api-reference/nodes-apis/nodes-stats.md b/_api-reference/nodes-apis/nodes-stats.md
@@ -731,7 +731,10 @@ Select the arrow to view the example response.
         "nxLWtMdXQmWA-ZBVWU8nwA": {
           "timestamp": 1698401391000,
           "cpu_utilization_percent": "0.1",
-          "memory_utilization_percent": "3.9"
+          "memory_utilization_percent": "3.9",
+          "io_usage_stats": {
+            "max_io_utilization_percent": "99.6"
+          }
         }
       },
       "admission_control": {
@@ -742,6 +745,14 @@ Select the arrow to view the example response.
               "indexing": 1
             }
           }
+        },
+        "global_io_usage": {
+          "transport": {
+            "rejection_count": {
+              "search": 3,
+              "indexing": 1
+            }
+          }
         }
       }
     }
@@ -1252,16 +1263,20 @@ The `resource_usage_stats` object contains the resource usage statistics. Each e
 Field | Field type | Description
 :--- |:-----------| :---
 timestamp | Integer    | The last refresh time for the resource usage statistics, in milliseconds since the epoch.
-cpu_utilization_percent | Float      | Statistics for the average CPU usage of OpenSearch process within the time period configured in the `node.resource.tracker.global_cpu_usage.window_duration` setting.
+cpu_utilization_percent | Float      | Statistics for the average CPU usage of any OpenSearch processes within the time period configured in the `node.resource.tracker.global_cpu_usage.window_duration` setting.
 memory_utilization_percent | Float      | The node JVM memory usage statistics within the time period configured in the `node.resource.tracker.global_jvmmp.window_duration` setting.
+max_io_utilization_percent  | Float     |  (Linux only) Statistics for the average IO usage of any OpenSearch processes within the time period configured in the `node.resource.tracker.global_io_usage.window_duration` setting.
 
 ### `admission_control`
 
 The `admission_control` object contains the rejection count of search and indexing requests based on resource consumption and has the following properties.
+
 Field | Field type | Description
 :--- | :--- | :---
-admission_control.global_cpu_usage.transport.rejection_count.search | Integer | The total number of search rejections in the transport layer when the node CPU usage limit was breached. In this case, additional search requests are rejected until the system recovers.
-admission_control.global_cpu_usage.transport.rejection_count.indexing | Integer | The total number of indexing rejections in the transport layer when the node CPU usage limit was breached. In this case, additional indexing requests are rejected until the system recovers.
+admission_control.global_cpu_usage.transport.rejection_count.search | Integer | The total number of search rejections in the transport layer when the node CPU usage limit was met. In this case, additional search requests are rejected until the system recovers. The CPU usage limit is configured in the `admission_control.search.cpu_usage.limit` setting.
+admission_control.global_cpu_usage.transport.rejection_count.indexing | Integer | The total number of indexing rejections in the transport layer when the node CPU usage limit was met. Any additional indexing requests are rejected until the system recovers. The CPU usage limit is configured in the `admission_control.indexing.cpu_usage.limit` setting.
+admission_control.global_io_usage.transport.rejection_count.search | Integer | The total number of search rejections in the transport layer when the node IO usage limit was met. Any additional search requests are rejected until the system recovers. The CPU usage limit is configured in the `admission_control.search.io_usage.limit` setting (Linux only).
+admission_control.global_io_usage.transport.rejection_count.indexing | Integer | The total number of indexing rejections in the transport layer when the node IO usage limit was met. Any additional indexing requests are rejected until the system recovers. The IO usage limit is configured in the `admission_control.indexing.io_usage.limit` setting (Linux only).
 
 ## Required permissions
 

diff --git a/_api-reference/snapshots/get-snapshot-status.md b/_api-reference/snapshots/get-snapshot-status.md
@@ -29,9 +29,9 @@ Three request variants provide flexibility:
 
 * `GET _snapshot/_status` returns the status of all currently running snapshots in all repositories.
 
-* `GET _snapshot/<repository>/_status` returns the status of only currently running snapshots in the specified repository. This is the preferred variant.
+* `GET _snapshot/<repository>/_status` returns all currently running snapshots in the specified repository. This is the preferred variant.
 
-* `GET _snapshot/<repository>/<snapshot>/_status` returns the status of all snapshots in the specified repository whether they are running or not.
+* `GET _snapshot/<repository>/<snapshot>/_status` returns detailed status information for a specific snapshot in the specified repository, regardless of whether it's currently running or not. 
 
 Using the API to return state for other than currently running snapshots can be very costly for (1) machine machine resources and (2) processing time if running in the cloud. For each snapshot, each request causes file reads from all a snapshot's shards. 
 {: .warning}
@@ -420,4 +420,4 @@ All property values are Integers.
 :--- | :--- | :--- |
 | shards_stats | Object | See [Shard stats](#shard-stats). |
 | stats | Object | See [Snapshot file stats](#snapshot-file-stats). |
-| shards | list of Objects | List of objects containing information about the shards that include the snapshot. Properies of the shards are listed below in bold text. <br /><br /> **stage**: Current state of shards in the snapshot. Shard states are: <br /><br /> * DONE: Number of shards in the snapshot that were successfully stored in the repository. <br /><br /> * FAILURE: Number of shards in the snapshot that were not successfully stored in the repository. <br /><br /> * FINALIZE: Number of shards in the snapshot that are in the finalizing stage of being stored in the repository. <br /><br />* INIT: Number of shards in the snapshot that are in the initializing stage of being stored in the repository.<br /><br />* STARTED:  Number of shards in the snapshot that are in the started stage of being stored in the repository.<br /><br /> **stats**: See [Snapshot file stats](#snapshot-file-stats). <br /><br /> **total**: Total number and size of files referenced by the snapshot. <br /><br /> **start_time_in_millis**: Time (in milliseconds) when snapshot creation began. <br /><br /> **time_in_millis**: Total time (in milliseconds) that the snapshot took to complete.  |
+| shards | list of Objects | List of objects containing information about the shards that include the snapshot. OpenSearch returns the following properties about the shards. <br /><br /> **stage**: Current state of shards in the snapshot. Shard states are: <br /><br /> * DONE: Number of shards in the snapshot that were successfully stored in the repository. <br /><br /> * FAILURE: Number of shards in the snapshot that were not successfully stored in the repository. <br /><br /> * FINALIZE: Number of shards in the snapshot that are in the finalizing stage of being stored in the repository. <br /><br />* INIT: Number of shards in the snapshot that are in the initializing stage of being stored in the repository.<br /><br />* STARTED:  Number of shards in the snapshot that are in the started stage of being stored in the repository.<br /><br /> **stats**: See [Snapshot file stats](#snapshot-file-stats). <br /><br /> **total**: Total number and size of files referenced by the snapshot. <br /><br /> **start_time_in_millis**: Time (in milliseconds) when snapshot creation began. <br /><br /> **time_in_millis**: Total time (in milliseconds) that the snapshot took to complete.  |
diff --git a/_automating-configurations/index.md b/_automating-configurations/index.md
@@ -8,7 +8,7 @@ redirect_from: /automating-configurations/
 ---
 
 # Automating configurations
-**Introduced 2.12**
+**Introduced 2.13**
 {: .label .label-purple }
 
 You can automate complex OpenSearch setup and preprocessing tasks by providing templates for common use cases. For example, automating machine learning (ML) setup tasks streamlines the use of OpenSearch ML offerings.

diff --git a/_dashboards/dashboards-assistant/index.md b/_dashboards/dashboards-assistant/index.md
@@ -6,14 +6,11 @@ has_children: false
 has_toc: false
 ---
 
-This is an experimental feature and is not recommended for use in a production environment. For updates on the feature's progress or to leave feedback, go to the [`dashboards-assistant` repository](https://github.com/opensearch-project/dashboards-assistant) on GitHub or the associated [OpenSearch forum thread](https://forum.opensearch.org/t/feedback-opensearch-assistant/16741).
-{: .warning}
-
 Note that machine learning models are probabilistic and that some may perform better than others, so the OpenSearch Assistant may occasionally produce inaccurate information. We recommend evaluating outputs for accuracy as appropriate to your use case, including reviewing the output or combining it with other verification factors.
 {: .important}
 
 # OpenSearch Assistant for OpenSearch Dashboards
-Introduced 2.12
+**Introduced 2.13**
 {: .label .label-purple }
 
 The OpenSearch Assistant toolkit helps you create AI-powered assistants for OpenSearch Dashboards without requiring you to have specialized query tools or skills.
@@ -49,9 +46,6 @@ A screenshot of the interface is shown in the following image.
 
 <img width="700" src="{{site.url}}{{site.baseurl}}/images/dashboards/opensearch-assistant-full-frame.png" alt="OpenSearch Assistant interface">
 
-For more information about ways to enable experimental features, see [Experimental feature flags]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/experimental/).
-{: .note}
-
 ## Configuring OpenSearch Assistant
 
 You can use the OpenSearch Dashboards interface to configure OpenSearch Assistant. Go to the [Getting started guide](https://github.com/opensearch-project/dashboards-assistant/blob/main/GETTING_STARTED_GUIDE.md) for step-by-step instructions. For the chatbot template, go to the [Flow Framework plugin](https://github.com/opensearch-project/flow-framework) documentation. You can modify this template to use your own model and customize the chatbot tools. 

diff --git a/_data-prepper/managing-data-prepper/configuring-data-prepper.md b/_data-prepper/managing-data-prepper/configuring-data-prepper.md
@@ -128,6 +128,7 @@ extensions:
         region: <YOUR_REGION_1>
         sts_role_arn: <YOUR_STS_ROLE_ARN_1>
         refresh_interval: <YOUR_REFRESH_INTERVAL>
+        disable_refresh: false
       <YOUR_SECRET_CONFIG_ID_2>:
         ...
 ```
@@ -148,7 +149,8 @@ Option | Required | Type | Description
 secret_id  | Yes | String | The AWS secret name or ARN.                                                                                                                                                                                              |
 region | No | String   | The AWS region of the secret. Defaults to `us-east-1`.                                                                                                                                                                            
 sts_role_arn | No | String   | The AWS Security Token Service (AWS STS) role to assume for requests to the AWS Secrets Manager. Defaults to `null`, which will use the [standard SDK behavior for credentials](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html). 
-refresh_interval | No | Duration | The refreshment interval for AWS secrets extension plugin to poll new secret values. Defaults to `PT1H`. See [Automatically refreshing secrets](#automatically-refreshing-secrets) for details.                             
+refresh_interval | No | Duration | The refreshment interval for the AWS Secrets extension plugin to poll new secret values. Defaults to `PT1H`. For more information, see [Automatically refreshing secrets](#automatically-refreshing-secrets).
+disable_refresh | No | Boolean | Disables regular polling on the latest secret values inside the AWS secrets extension plugin. Defaults to `false`. When set to `true`, `refresh_interval` will not be used.
 
 #### Reference secrets
 ß

diff --git a/_data-prepper/managing-data-prepper/extensions/extensions.md b/_data-prepper/managing-data-prepper/extensions/extensions.md
@@ -0,0 +1,15 @@
+---
+layout: default
+title: Extensions
+parent: Managing Data Prepper
+has_children: true
+nav_order: 18
+---
+
+# Extensions
+
+Data Prepper extensions provide Data Prepper functionality outside of core Data Prepper pipeline components.
+Many extensions provide configuration options that give Data Prepper administrators greater flexibility over Data Prepper's functionality.
+
+Extension configurations can be configured in the `data-prepper-config.yaml` file under the `extensions:` YAML block.
+
diff --git a/_data-prepper/managing-data-prepper/extensions/geoip_service.md b/_data-prepper/managing-data-prepper/extensions/geoip_service.md
@@ -0,0 +1,67 @@
+---
+layout: default
+title: geoip_service
+nav_order: 5
+parent: Extensions
+grand_parent: Managing Data Prepper
+---
+
+# geoip_service
+
+The `geoip_service` extension configures all [`geoip`]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/geoip) processors in Data Prepper.
+
+## Usage
+
+You can configure the GeoIP service that Data Prepper uses for the `geoip` processor.
+By default, the GeoIP service comes with the [`maxmind`](#maxmind) option configured.
+
+The following example shows how to configure the `geoip_service` in the `data-prepper-config.yaml` file:
+
+```
+extensions:
+  geoip_service:
+    maxmind:
+      database_refresh_interval: PT1H
+      cache_count: 16_384
+```
+
+## maxmind
+
+The GeoIP service supports the MaxMind [GeoIP and GeoLite](https://dev.maxmind.com/geoip) databases.
+By default, Data Prepper will use all three of the following [MaxMind GeoLite2](https://dev.maxmind.com/geoip/geolite2-free-geolocation-data) databases:
+
+* City
+* Country
+* ASN
+
+The service also downloads databases automatically to keep Data Prepper up to date with changes from MaxMind.
+
+You can use the following options to configure the `maxmind` extension.
+
+Option | Required | Type | Description
+:--- | :--- | :--- | :---
+`databases` | No | [database](#database) | The database configuration.
+`database_refresh_interval` | No | Duration | How frequently to check for updates from MaxMind. This can be any duration in the range of 15 minutes to 30 days. Default is `PT7D`.
+`cache_count` | No | Integer | The maximum cache count by number of items in the cache, with a range of 100--100,000. Default is `4096`.
+`database_destination` | No | String | The name of the directory in which to store downloaded databases. Default is `{data-prepper.dir}/data/geoip`.
+`aws` | No | [aws](#aws) | Configures the AWS credentials for downloading the database from Amazon Simple Storage Service (Amazon S3).
+`insecure` | No | Boolean | When `true`, this options allows you to download database files over HTTP. Default is `false`.
+
+## database
+
+Option | Required | Type | Description
+:--- | :--- | :--- | :---
+`city` | No | String | The URL of the city in which the database resides. Can be an HTTP URL for a manifest file, an MMDB file, or an S3 URL.
+`country` | No | String | The URL of the country in which the database resides. Can be an HTTP URL for a manifest file, an MMDB file, or an S3 URL.
+`asn` | No | String | The URL of the Autonomous System Number (ASN) of where the database resides. Can be an HTTP URL for a manifest file, an MMDB file, or an S3 URL.
+`enterprise` | No | String | The URL of the enterprise in which the database resides. Can be an HTTP URL for a manifest file, an MMDB file, or an S3 URL.
+
+
+## aws
+
+Option | Required | Type | Description
+:--- | :--- | :--- | :---
+`region` | No | String | The AWS Region to use for the credentials. Default is the [standard SDK behavior for determining the Region](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/region-selection.html).
+`sts_role_arn` | No | String | The AWS Security Token Service (AWS STS) role to assume for requests to Amazon S3. Default is `null`, which will use the [standard SDK behavior for credentials](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html).
+`aws_sts_header_overrides` | No | Map | A map of header overrides that the AWS Identity and Access Management (IAM) role assumes when downloading from Amazon S3.
+`sts_external_id` | No | String | An STS external ID used when Data Prepper assumes the STS role. For more information, see the `ExternalID` documentation in the [STS AssumeRole](https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html) API reference.