From 9570cecf1457abb3676df8caf87da7de7b63b1dc Mon Sep 17 00:00:00 2001 From: Riya <69919272+riysaxen-amzn@users.noreply.github.com> Date: Tue, 26 Mar 2024 10:19:54 -0700 Subject: [PATCH 01/18] Add new parameters to findings API in security analytics (#6499) * adding more parrams findings API Signed-off-by: Riya Saxena * Update _security-analytics/api-tools/alert-finding-api.md Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Signed-off-by: Riya <69919272+riysaxen-amzn@users.noreply.github.com> * Update _security-analytics/api-tools/alert-finding-api.md Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Signed-off-by: Riya <69919272+riysaxen-amzn@users.noreply.github.com> * Update _security-analytics/api-tools/alert-finding-api.md Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Signed-off-by: Riya <69919272+riysaxen-amzn@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update _security-analytics/api-tools/alert-finding-api.md Co-authored-by: Nathan Bower Signed-off-by: Riya <69919272+riysaxen-amzn@users.noreply.github.com> --------- Signed-off-by: Riya Saxena Signed-off-by: Riya <69919272+riysaxen-amzn@users.noreply.github.com> Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Nathan Bower --- .../api-tools/alert-finding-api.md | 221 +++++++++++++++++- 1 file changed, 219 insertions(+), 2 deletions(-) diff --git a/_security-analytics/api-tools/alert-finding-api.md b/_security-analytics/api-tools/alert-finding-api.md index a22b601b08..f2631f2a50 100644 --- a/_security-analytics/api-tools/alert-finding-api.md +++ b/_security-analytics/api-tools/alert-finding-api.md @@ -149,13 +149,230 @@ You can specify the following parameters when getting findings. Parameter | Description :--- | :--- -`detector_id` | The ID of the detector used to fetch alerts. Optional when the `detectorType` is specified. Otherwise required. -`detectorType` | The type of detector used to fetch alerts. Optional when the `detector_id` is specified. Otherwise required. +`detector_id` | The ID of the detector used to fetch alerts. Optional. +`detectorType` | The type of detector used to fetch alerts. Optional. `sortOrder` | The order used to sort the list of findings. Possible values are `asc` or `desc`. Optional. `size` | An optional limit for the maximum number of results returned in the response. Optional. +`startIndex` | The pagination indicator. Optional. +`detectionType` | The detection rule type that dictates the retrieval type for the findings. When the detection type is `threat`, it fetches threat intelligence feeds. When the detection type is `rule`, findings are fetched based on the detector's rule. Optional. +`severity` | The severity of the detector rule used to fetch alerts. Severity can be `critical`, `high`, `medium`, or `low`. Optional. ### Example request +```json +GET /_plugins/_security_analytics/findings/_search +{ + "total_findings": 2, + "findings": [ + { + "detectorId": "b9ZN040Bjlggkcgx1d1W", + "id": "35efb736-c5d9-499d-b9b5-31f0a7d61251", + "related_doc_ids": [ + "1" + ], + "index": "smallidx", + "queries": [ + { + "id": "QdZN040Bjlggkcgxdd3X", + "name": "QdZN040Bjlggkcgxdd3X", + "fields": [], + "query": "field1: *value1*", + "tags": [ + "high", + "ad_ldap" + ] + } + ], + "timestamp": 1708647166500, + "document_list": [ + { + "index": "smallidx", + "id": "1", + "found": true, + "document": "{\n \"field1\": \"value1\"\n}\n" + } + ] + }, + { + "detectorId": "O9ZM040Bjlggkcgx6N1S", + "id": "a5022930-4503-4ca8-bf0a-320a2b1fb433", + "related_doc_ids": [ + "1" + ], + "index": "smallidx", + "queries": [ + { + "id": "KtZM040Bjlggkcgxkd04", + "name": "KtZM040Bjlggkcgxkd04", + "fields": [], + "query": "field1: *value1*", + "tags": [ + "critical", + "ad_ldap" + ] + } + ], + "timestamp": 1708647166500, + "document_list": [ + { + "index": "smallidx", + "id": "1", + "found": true, + "document": "{\n \"field1\": \"value1\"\n}\n" + } + ] + } + ] +} + +``` + +```json +GET /_plugins/_security_analytics/findings/_search?severity=high +{ + "total_findings": 1, + "findings": [ + { + "detectorId": "b9ZN040Bjlggkcgx1d1W", + "id": "35efb736-c5d9-499d-b9b5-31f0a7d61251", + "related_doc_ids": [ + "1" + ], + "index": "smallidx", + "queries": [ + { + "id": "QdZN040Bjlggkcgxdd3X", + "name": "QdZN040Bjlggkcgxdd3X", + "fields": [], + "query": "field1: *value1*", + "tags": [ + "high", + "ad_ldap" + ] + } + ], + "timestamp": 1708647166500, + "document_list": [ + { + "index": "smallidx", + "id": "1", + "found": true, + "document": "{\n \"field1\": \"value1\"\n}\n" + } + ] + } + ] +} + +``` + +```json +GET /_plugins/_security_analytics/findings/_search?detectionType=rule +{ + "total_findings": 2, + "findings": [ + { + "detectorId": "b9ZN040Bjlggkcgx1d1W", + "id": "35efb736-c5d9-499d-b9b5-31f0a7d61251", + "related_doc_ids": [ + "1" + ], + "index": "smallidx", + "queries": [ + { + "id": "QdZN040Bjlggkcgxdd3X", + "name": "QdZN040Bjlggkcgxdd3X", + "fields": [], + "query": "field1: *value1*", + "tags": [ + "high", + "ad_ldap" + ] + } + ], + "timestamp": 1708647166500, + "document_list": [ + { + "index": "smallidx", + "id": "1", + "found": true, + "document": "{\n \"field1\": \"value1\"\n}\n" + } + ] + }, + { + "detectorId": "O9ZM040Bjlggkcgx6N1S", + "id": "a5022930-4503-4ca8-bf0a-320a2b1fb433", + "related_doc_ids": [ + "1" + ], + "index": "smallidx", + "queries": [ + { + "id": "KtZM040Bjlggkcgxkd04", + "name": "KtZM040Bjlggkcgxkd04", + "fields": [], + "query": "field1: *value1*", + "tags": [ + "critical", + "ad_ldap" + ] + } + ], + "timestamp": 1708647166500, + "document_list": [ + { + "index": "smallidx", + "id": "1", + "found": true, + "document": "{\n \"field1\": \"value1\"\n}\n" + } + ] + } + ] +} + + +``` +```json +GET /_plugins/_security_analytics/findings/_search?detectionType=rule&severity=high +{ + "total_findings": 1, + "findings": [ + { + "detectorId": "b9ZN040Bjlggkcgx1d1W", + "id": "35efb736-c5d9-499d-b9b5-31f0a7d61251", + "related_doc_ids": [ + "1" + ], + "index": "smallidx", + "queries": [ + { + "id": "QdZN040Bjlggkcgxdd3X", + "name": "QdZN040Bjlggkcgxdd3X", + "fields": [], + "query": "field1: *value1*", + "tags": [ + "high", + "ad_ldap" + ] + } + ], + "timestamp": 1708647166500, + "document_list": [ + { + "index": "smallidx", + "id": "1", + "found": true, + "document": "{\n \"field1\": \"value1\"\n}\n" + } + ] + } + ] +} + +``` + ```json GET /_plugins/_security_analytics/findings/_search?*detectorType*= { From 005886f4bce4422989f9df861f6507bfa8a380ae Mon Sep 17 00:00:00 2001 From: Gagan Juneja Date: Tue, 26 Mar 2024 23:00:26 +0530 Subject: [PATCH 02/18] Add the supported metric types (#6754) * Update getting-started.md Signed-off-by: Gagan Juneja * incorporate review comments Signed-off-by: Gagan Juneja * Update getting-started.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --------- Signed-off-by: Gagan Juneja Signed-off-by: Gagan Juneja Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Gagan Juneja Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Nathan Bower --- .../metrics/getting-started.md | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/_monitoring-your-cluster/metrics/getting-started.md b/_monitoring-your-cluster/metrics/getting-started.md index 21edceda7b..659614a07c 100644 --- a/_monitoring-your-cluster/metrics/getting-started.md +++ b/_monitoring-your-cluster/metrics/getting-started.md @@ -1,8 +1,9 @@ --- layout: default -title: Metrics framework -parent: Trace Analytics -nav_order: 65 +title: Metrics framework +nav_order: 1 +has_children: false +has_toc: false redirect_from: - /monitoring-your-cluster/metrics/ --- @@ -95,3 +96,12 @@ The metrics framework feature supports various telemetry solutions through plugi 2. **Exporters:** Exporters are responsible for persisting the data. OpenTelemetry provides several out-of-the-box exporters. OpenSearch supports the following exporters: - `LoggingMetricExporter`: Exports metrics to a log file, generating a separate file in the logs directory `_otel_metrics.log`. Default is `telemetry.otel.metrics.exporter.class=io.opentelemetry.exporter.logging.LoggingMetricExporter`. - `OtlpGrpcMetricExporter`: Exports spans through gRPC. To use this exporter, you need to install the `otel-collector` on the node. By default, it writes to the http://localhost:4317/ endpoint. To use this exporter, set the following static setting: `telemetry.otel.metrics.exporter.class=io.opentelemetry.exporter.otlp.metrics.OtlpGrpcMetricExporter`. + +### Supported metric types + +The metrics framework feature supports the following metric types: + +1. **Counters:** Counters are continuous and synchronous meters used to track the frequency of events over time. Counters can only be incremented with positive values, making them ideal for measuring the number of monitoring occurrences such as errors, processed or received bytes, and total requests. +2. **UpDown counters:** UpDown counters can be incremented with positive values or decremented with negative values. UpDown counters are well suited for tracking metrics like open connections, active requests, and other fluctuating quantities. +3. **Histograms:** Histograms are valuable tools for visualizing the distribution of continuous data. Histograms offer insight into the central tendency, spread, skewness, and potential outliers that might exist in your metrics. Patterns such as normal distribution, skewed distribution, or bimodal distribution can be readily identified, making histograms ideal for analyzing latency metrics and assessing percentiles. +4. **Asynchronous Gauges:** Asynchronous gauges capture the current value at the moment a metric is read. These metrics are non-additive and are commonly used to measure CPU utilization on a per-minute basis, memory utilization, and other real-time values. From 2992e42e87b9a56c08c1d8bcc458770b01e239e3 Mon Sep 17 00:00:00 2001 From: Movva Ajaykumar Date: Tue, 26 Mar 2024 23:31:27 +0530 Subject: [PATCH 03/18] Adding Documentation for IO Based AdmissionController Stats (#6755) * Adding Documentation for IO Based AdmissionController Stats Signed-off-by: Ajay Kumar Movva * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --------- Signed-off-by: Ajay Kumar Movva Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Ajay Kumar Movva Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --- _api-reference/nodes-apis/nodes-stats.md | 23 +++++++++++++++++++---- 1 file changed, 19 insertions(+), 4 deletions(-) diff --git a/_api-reference/nodes-apis/nodes-stats.md b/_api-reference/nodes-apis/nodes-stats.md index 4fdb5c3cb8..87365fa900 100644 --- a/_api-reference/nodes-apis/nodes-stats.md +++ b/_api-reference/nodes-apis/nodes-stats.md @@ -731,7 +731,10 @@ Select the arrow to view the example response. "nxLWtMdXQmWA-ZBVWU8nwA": { "timestamp": 1698401391000, "cpu_utilization_percent": "0.1", - "memory_utilization_percent": "3.9" + "memory_utilization_percent": "3.9", + "io_usage_stats": { + "max_io_utilization_percent": "99.6" + } } }, "admission_control": { @@ -742,6 +745,14 @@ Select the arrow to view the example response. "indexing": 1 } } + }, + "global_io_usage": { + "transport": { + "rejection_count": { + "search": 3, + "indexing": 1 + } + } } } } @@ -1252,16 +1263,20 @@ The `resource_usage_stats` object contains the resource usage statistics. Each e Field | Field type | Description :--- |:-----------| :--- timestamp | Integer | The last refresh time for the resource usage statistics, in milliseconds since the epoch. -cpu_utilization_percent | Float | Statistics for the average CPU usage of OpenSearch process within the time period configured in the `node.resource.tracker.global_cpu_usage.window_duration` setting. +cpu_utilization_percent | Float | Statistics for the average CPU usage of any OpenSearch processes within the time period configured in the `node.resource.tracker.global_cpu_usage.window_duration` setting. memory_utilization_percent | Float | The node JVM memory usage statistics within the time period configured in the `node.resource.tracker.global_jvmmp.window_duration` setting. +max_io_utilization_percent | Float | (Linux only) Statistics for the average IO usage of any OpenSearch processes within the time period configured in the `node.resource.tracker.global_io_usage.window_duration` setting. ### `admission_control` The `admission_control` object contains the rejection count of search and indexing requests based on resource consumption and has the following properties. + Field | Field type | Description :--- | :--- | :--- -admission_control.global_cpu_usage.transport.rejection_count.search | Integer | The total number of search rejections in the transport layer when the node CPU usage limit was breached. In this case, additional search requests are rejected until the system recovers. -admission_control.global_cpu_usage.transport.rejection_count.indexing | Integer | The total number of indexing rejections in the transport layer when the node CPU usage limit was breached. In this case, additional indexing requests are rejected until the system recovers. +admission_control.global_cpu_usage.transport.rejection_count.search | Integer | The total number of search rejections in the transport layer when the node CPU usage limit was met. In this case, additional search requests are rejected until the system recovers. The CPU usage limit is configured in the `admission_control.search.cpu_usage.limit` setting. +admission_control.global_cpu_usage.transport.rejection_count.indexing | Integer | The total number of indexing rejections in the transport layer when the node CPU usage limit was met. Any additional indexing requests are rejected until the system recovers. The CPU usage limit is configured in the `admission_control.indexing.cpu_usage.limit` setting. +admission_control.global_io_usage.transport.rejection_count.search | Integer | The total number of search rejections in the transport layer when the node IO usage limit was met. Any additional search requests are rejected until the system recovers. The CPU usage limit is configured in the `admission_control.search.io_usage.limit` setting (Linux only). +admission_control.global_io_usage.transport.rejection_count.indexing | Integer | The total number of indexing rejections in the transport layer when the node IO usage limit was met. Any additional indexing requests are rejected until the system recovers. The IO usage limit is configured in the `admission_control.indexing.io_usage.limit` setting (Linux only). ## Required permissions From cf6e39e7a969088f408ccc61951c495b35374ac2 Mon Sep 17 00:00:00 2001 From: Heather Halter Date: Tue, 26 Mar 2024 11:52:21 -0700 Subject: [PATCH 04/18] Update get-snapshot-status.md (#6572) * Update get-snapshot-status.md Signed-off-by: Heather Halter * Update get-snapshot-status.md Signed-off-by: Heather Halter * Update get-snapshot-status.md Signed-off-by: Heather Halter * Update get-snapshot-status.md Signed-off-by: Heather Halter * Update _api-reference/snapshots/get-snapshot-status.md I'm not seeing what you changed, but I trust something was fixed. Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Signed-off-by: Heather Halter --------- Signed-off-by: Heather Halter Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --- _api-reference/snapshots/get-snapshot-status.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/_api-reference/snapshots/get-snapshot-status.md b/_api-reference/snapshots/get-snapshot-status.md index 02aa419042..6f8320d0b0 100644 --- a/_api-reference/snapshots/get-snapshot-status.md +++ b/_api-reference/snapshots/get-snapshot-status.md @@ -29,9 +29,9 @@ Three request variants provide flexibility: * `GET _snapshot/_status` returns the status of all currently running snapshots in all repositories. -* `GET _snapshot//_status` returns the status of only currently running snapshots in the specified repository. This is the preferred variant. +* `GET _snapshot//_status` returns all currently running snapshots in the specified repository. This is the preferred variant. -* `GET _snapshot///_status` returns the status of all snapshots in the specified repository whether they are running or not. +* `GET _snapshot///_status` returns detailed status information for a specific snapshot in the specified repository, regardless of whether it's currently running or not. Using the API to return state for other than currently running snapshots can be very costly for (1) machine machine resources and (2) processing time if running in the cloud. For each snapshot, each request causes file reads from all a snapshot's shards. {: .warning} @@ -420,4 +420,4 @@ All property values are Integers. :--- | :--- | :--- | | shards_stats | Object | See [Shard stats](#shard-stats). | | stats | Object | See [Snapshot file stats](#snapshot-file-stats). | -| shards | list of Objects | List of objects containing information about the shards that include the snapshot. Properies of the shards are listed below in bold text.

**stage**: Current state of shards in the snapshot. Shard states are:

* DONE: Number of shards in the snapshot that were successfully stored in the repository.

* FAILURE: Number of shards in the snapshot that were not successfully stored in the repository.

* FINALIZE: Number of shards in the snapshot that are in the finalizing stage of being stored in the repository.

* INIT: Number of shards in the snapshot that are in the initializing stage of being stored in the repository.

* STARTED: Number of shards in the snapshot that are in the started stage of being stored in the repository.

**stats**: See [Snapshot file stats](#snapshot-file-stats).

**total**: Total number and size of files referenced by the snapshot.

**start_time_in_millis**: Time (in milliseconds) when snapshot creation began.

**time_in_millis**: Total time (in milliseconds) that the snapshot took to complete. | \ No newline at end of file +| shards | list of Objects | List of objects containing information about the shards that include the snapshot. OpenSearch returns the following properties about the shards.

**stage**: Current state of shards in the snapshot. Shard states are:

* DONE: Number of shards in the snapshot that were successfully stored in the repository.

* FAILURE: Number of shards in the snapshot that were not successfully stored in the repository.

* FINALIZE: Number of shards in the snapshot that are in the finalizing stage of being stored in the repository.

* INIT: Number of shards in the snapshot that are in the initializing stage of being stored in the repository.

* STARTED: Number of shards in the snapshot that are in the started stage of being stored in the repository.

**stats**: See [Snapshot file stats](#snapshot-file-stats).

**total**: Total number and size of files referenced by the snapshot.

**start_time_in_millis**: Time (in milliseconds) when snapshot creation began.

**time_in_millis**: Total time (in milliseconds) that the snapshot took to complete. | From bb8cb9cf798a09502e12a6aebc59faadf0f72405 Mon Sep 17 00:00:00 2001 From: leedonggyu Date: Wed, 27 Mar 2024 08:16:27 +0900 Subject: [PATCH 05/18] Fix key of system index configuration (#6788) Signed-off-by: leedonggyu --- _security/access-control/permissions.md | 2 +- _security/configuration/yaml.md | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/_security/access-control/permissions.md b/_security/access-control/permissions.md index 4f8df5e042..0b2d609c35 100644 --- a/_security/access-control/permissions.md +++ b/_security/access-control/permissions.md @@ -124,7 +124,7 @@ green open .kibana_3 XmTePICFRoSNf5O5uLgwRw 1 1 220 0 468.3kb 232.1kb ### Enabling system index permissions -Users that have the permission [`restapi:admin/roles`]({{site.url}}{{site.baseurl}}/security/access-control/api/#access-control-for-the-api) are able to map system index permissions to all users in the same way they would for a cluster or index permission in the `roles.yml` file. However, to preserve some control over this permission, the `plugins.security.system_indices.permissions.enabled` setting allows you to enable or disable the system index permissions feature. This setting is disabled by default. To enable the system index permissions feature, set `plugins.security.system_indices.permissions.enabled` to `true`. For more information about this setting, see [Enabling user access to system indexes]({{site.url}}{{site.baseurl}}/security/configuration/yaml/#enabling-user-access-to-system-indexes). +Users that have the permission [`restapi:admin/roles`]({{site.url}}{{site.baseurl}}/security/access-control/api/#access-control-for-the-api) are able to map system index permissions to all users in the same way they would for a cluster or index permission in the `roles.yml` file. However, to preserve some control over this permission, the `plugins.security.system_indices.permission.enabled` setting allows you to enable or disable the system index permissions feature. This setting is disabled by default. To enable the system index permissions feature, set `plugins.security.system_indices.permissions.enabled` to `true`. For more information about this setting, see [Enabling user access to system indexes]({{site.url}}{{site.baseurl}}/security/configuration/yaml/#enabling-user-access-to-system-indexes). Keep in mind that enabling this feature and mapping system index permissions to normal users gives those users access to indexes that may contain sensitive information and configurations essential to a cluster's health. We also recommend caution when mapping users to `restapi:admin/roles` because this permission gives a user not only the ability to assign the system index permission to another user but also the ability to self-assign access to any system index. {: .warning } diff --git a/_security/configuration/yaml.md b/_security/configuration/yaml.md index 258866a7f8..af60238b42 100644 --- a/_security/configuration/yaml.md +++ b/_security/configuration/yaml.md @@ -139,12 +139,12 @@ plugins.security.cache.ttl_minutes: 60 ### Enabling user access to system indexes -Mapping a system index permission to a user allows that user to modify the system index specified in the permission's name (the one exception is the Security plugin's [system index]({{site.url}}{{site.baseurl}}/security/configuration/system-indices/)). The `plugins.security.system_indices.permissions.enabled` setting provides a way for administrators to make this permission available for or hidden from role mapping. +Mapping a system index permission to a user allows that user to modify the system index specified in the permission's name (the one exception is the Security plugin's [system index]({{site.url}}{{site.baseurl}}/security/configuration/system-indices/)). The `plugins.security.system_indices.permission.enabled` setting provides a way for administrators to make this permission available for or hidden from role mapping. When set to `true`, the feature is enabled and users with permission to modify roles can create roles that include permissions that grant access to system indexes: ```yml -plugins.security.system_indices.permissions.enabled: true +plugins.security.system_indices.permission.enabled: true ``` When set to `false`, the permission is disabled and only admins with an admin certificate can make changes to system indexes. By default, the permission is set to `false` in a new cluster. From e697522b99a1b77b83362d328e0ae940326e445e Mon Sep 17 00:00:00 2001 From: Krishna Kondaka <41027584+kkondaka@users.noreply.github.com> Date: Wed, 27 Mar 2024 09:29:59 -0700 Subject: [PATCH 06/18] Added documentation for select entries, map-to-list, and trucate processors. (#6660) * Added documentation for select entries and trucate processors. Updated other documents Signed-off-by: Kondaka * Update select-entries.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update and rename truncate-processor.md to truncate.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Delete _data-prepper/pipelines/configuration/processors/map-to-list.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update _data-prepper/pipelines/configuration/processors/mutate-event.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update _data-prepper/pipelines/configuration/buffers/kafka.md Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update _data-prepper/pipelines/configuration/buffers/kafka.md Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update _data-prepper/pipelines/configuration/buffers/kafka.md Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update _data-prepper/pipelines/configuration/processors/mutate-event.md Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update _data-prepper/pipelines/configuration/sources/s3.md Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update _data-prepper/pipelines/configuration/processors/truncate.md Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --------- Signed-off-by: Kondaka Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Nathan Bower --- .../pipelines/configuration/buffers/kafka.md | 15 ++- .../configuration/processors/mutate-event.md | 6 +- .../processors/select-entries.md | 51 +++++++++ .../configuration/processors/truncate.md | 107 ++++++++++++++++++ .../pipelines/configuration/sinks/file.md | 1 + .../pipelines/configuration/sources/s3.md | 1 + 6 files changed, 176 insertions(+), 5 deletions(-) create mode 100644 _data-prepper/pipelines/configuration/processors/select-entries.md create mode 100644 _data-prepper/pipelines/configuration/processors/truncate.md diff --git a/_data-prepper/pipelines/configuration/buffers/kafka.md b/_data-prepper/pipelines/configuration/buffers/kafka.md index 675a0c9775..f641874a91 100644 --- a/_data-prepper/pipelines/configuration/buffers/kafka.md +++ b/_data-prepper/pipelines/configuration/buffers/kafka.md @@ -41,11 +41,12 @@ Use the following configuration options with the `kafka` buffer. Option | Required | Type | Description --- | --- | --- | --- -`bootstrap_servers` | Yes | String list | The host and port for the initial connection to the Kafka cluster. You can configure multiple Kafka brokers by using the IP address or the port number for each broker. When using [Amazon Managed Streaming for Apache Kafka (Amazon MSK)](https://aws.amazon.com/msk/) as your Kafka cluster, the bootstrap server information is obtained from Amazon MSK using the Amazon Resource Name (ARN) provided in the configuration. -`topics` | Yes | List | A list of [topics](#topic) to use. You must supply one topic per buffer. `authentication` | No | [Authentication](#authentication) | Sets the authentication options for both the pipeline and Kafka. For more information, see [Authentication](#authentication). -`encryption` | No | [Encryption](#encryption) | The encryption configuration for encryption in transit. For more information, see [Encryption](#encryption). `aws` | No | [AWS](#aws) | The AWS configuration. For more information, see [aws](#aws). +`bootstrap_servers` | Yes | String list | The host and port for the initial connection to the Kafka cluster. You can configure multiple Kafka brokers by using the IP address or the port number for each broker. When using [Amazon Managed Streaming for Apache Kafka (Amazon MSK)](https://aws.amazon.com/msk/) as your Kafka cluster, the bootstrap server information is obtained from Amazon MSK using the Amazon Resource Name (ARN) provided in the configuration. +`encryption` | No | [Encryption](#encryption) | The encryption configuration for encryption in transit. For more information, see [Encryption](#encryption). +`producer_properties` | No | [Producer Properties](#producer_properties) | A list of configurable Kafka producer properties. +`topics` | Yes | List | A list of [topics](#topic) for the buffer to use. You must supply one topic per buffer. ### topic @@ -73,6 +74,7 @@ Option | Required | Type | Description `retry_backoff` | No | Integer | The amount of time to wait before attempting to retry a failed request to a given topic partition. Default is `10s`. `max_poll_interval` | No | Integer | The maximum delay between invocations of a `poll()` when using group management through Kafka's `max.poll.interval.ms` option. Default is `300s`. `consumer_max_poll_records` | No | Integer | The maximum number of records returned in a single `poll()` call through Kafka's `max.poll.records` setting. Default is `500`. +`max_message_bytes` | No | Integer | The maximum size of the message, in bytes. Default is 1 MB. ### kms @@ -123,6 +125,13 @@ Option | Required | Type | Description `type` | No | String | The encryption type. Use `none` to disable encryption. Default is `ssl`. `insecure` | No | Boolean | A Boolean flag used to turn off SSL certificate verification. If set to `true`, certificate authority (CA) certificate verification is turned off and insecure HTTP requests are sent. Default is `false`. +#### producer_properties + +Use the following configuration options to configure a Kafka producer. +Option | Required | Type | Description +:--- | :--- | :--- | :--- +`max_request_size` | No | Integer | The maximum size of the request that the producer sends to Kafka. Default is 1 MB. + #### aws diff --git a/_data-prepper/pipelines/configuration/processors/mutate-event.md b/_data-prepper/pipelines/configuration/processors/mutate-event.md index 032bc89fcd..676cb2e168 100644 --- a/_data-prepper/pipelines/configuration/processors/mutate-event.md +++ b/_data-prepper/pipelines/configuration/processors/mutate-event.md @@ -11,11 +11,13 @@ nav_order: 65 Mutate event processors allow you to modify events in Data Prepper. The following processors are available: * [add_entries]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/add-entries/) allows you to add entries to an event. +* [convert_entry_type]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/convert_entry_type/) allows you to convert value types in an event. * [copy_values]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/copy-values/) allows you to copy values within an event. * [delete_entries]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/delete-entries/) allows you to delete entries from an event. -* [rename_keys]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/rename-keys/) allows you to rename keys in an event. -* [convert_entry_type]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/convert_entry_type/) allows you to convert value types in an event. * [list_to_map]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/list-to-map) allows you to convert list of objects from an event where each object contains a `key` field into a map of target keys. +* [map_to_list]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/map-to-list/) allows you to convert a map of objects from an event, where each object contains a `key` field, into a list of target keys. +* [rename_keys]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/rename-keys/) allows you to rename keys in an event. +* [select_entries]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/select-entries/) allows you to select entries from an event. diff --git a/_data-prepper/pipelines/configuration/processors/select-entries.md b/_data-prepper/pipelines/configuration/processors/select-entries.md new file mode 100644 index 0000000000..39b79a5bcc --- /dev/null +++ b/_data-prepper/pipelines/configuration/processors/select-entries.md @@ -0,0 +1,51 @@ +--- +layout: default +title: select_entries +parent: Processors +grand_parent: Pipelines +nav_order: 59 +--- + +# select_entries + +The `select_entries` processor selects entries from a Data Prepper event. Only the selected entries will remain in the event, and all other entries will be removed from the event. + +## Configuration + +You can configure the `select_entries` processor using the following options. + +| Option | Required | Description | +| :--- | :--- | :--- | +| `include_keys` | Yes | A list of keys to be selected from an event. | +| `select_when` | No | A [conditional expression](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/), such as `/some-key == "test"'`, that will be evaluated to determine whether the processor will be run on the event. | + +### Usage + +The following example shows how to configure the `select_entries` processor in the `pipeline.yaml` file: + +```yaml +pipeline: + source: + ... + .... + processor: + - select_entries: + entries: + - include_keys: [ "key1", "key2" ] + add_when: '/some_key == "test"' + sink: +``` +{% include copy.html %} + + +For example, when your source contains the following event record: + +```json +{"message": "hello", "key1" : "value1", "key2" : "value2", "some_key" : "test"} +``` + +The `select_entries` processor includes only `key1` and `key2` in the processed output: + +```json +{"key1": "value1", "key2": "value2"} +``` diff --git a/_data-prepper/pipelines/configuration/processors/truncate.md b/_data-prepper/pipelines/configuration/processors/truncate.md new file mode 100644 index 0000000000..3714d80847 --- /dev/null +++ b/_data-prepper/pipelines/configuration/processors/truncate.md @@ -0,0 +1,107 @@ +--- +layout: default +title: truncate +parent: Processors +grand_parent: Pipelines +nav_order: 121 +--- + +# truncate + +The `truncate` processor truncates a key's value at the beginning, the end, or on both sides of the value string, based on the processor's configuration. If the key's value is a list, then each member in the string list is truncated. Non-string members of the list are not truncated. When the `truncate_when` option is provided, input is truncated only when the condition specified is `true` for the event being processed. + +## Configuration + +You can configure the `truncate` processor using the following options. + +Option | Required | Type | Description +:--- | :--- | :--- | :--- +`entries` | Yes | String list | A list of entries to add to an event. +`source_keys` | No | String list | The list of source keys that will be modified by the processor. The default value is an empty list, which indicates that all values will be truncated. +`truncate_when` | No | Conditional expression | A condition that, when met, determines when the truncate operation is performed. +`start_at` | No | Integer | Where in the string value to start truncation. Default is `0`, which specifies to start truncation at the beginning of each key's value. +`length` | No | Integer| The length of the string after truncation. When not specified, the processor will measure the length based on where the string ends. + +Either the `start_at` or `length` options must be present in the configuration in order for the `truncate` processor to run. You can define both values in the configuration in order to further customize where truncation occurs in the string. + +## Usage + +The following examples show how to configure the `truncate` processor in the `pipeline.yaml` file: + +## Example: Minimum configuration + +The following example shows the minimum configuration for the `truncate` processor: + +```yaml +pipeline: + source: + file: + path: "/full/path/to/logs_json.log" + record_type: "event" + format: "json" + processor: + - truncate: + entries: + - source_keys: ["message1", "message2"] + length: 5 + - source_keys: ["info"] + length: 6 + start_at: 4 + - source_keys: ["log"] + start_at: 5 + sink: + - stdout: +``` + +For example, the following event contains several keys with string values: + +```json +{"message1": "hello,world", "message2": "test message", "info", "new information", "log": "test log message"} +``` + +The `truncate` processor produces the following output, where: + +- The `start_at` setting is `0` for the `message1` and `message 2` keys, indicating that truncation will begin at the start of the string, with the string itself truncated to a length of `5`. +- The `start_at` setting is `4` for the `info` key, indicating that truncation will begin at letter `i` of the string, with the string truncated to a length of `6`. +- The `start_at` setting is `5` for the `log` key, with no length specified, indicating that truncation will begin at letter `l` of the string. + +```json +{"message1":"hello", "message2":"test ", "info":"inform", "log": "log message"} +``` + + +## Example: Using `truncate_when` + +The following example configuration shows the `truncate` processor with the `truncate_when` option configured: + +```yaml +pipeline: + source: + file: + path: "/full/path/to/logs_json.log" + record_type: "event" + format: "json" + processor: + - truncate: + entries: + - source_keys: ["message"] + length: 5 + start_at: 8 + truncate_when: '/id == 1' + sink: + - stdout: +``` + +The following example contains two events: + +```json +{"message": "hello, world", "id": 1} +{"message": "hello, world,not-truncated", "id": 2} +``` + +When the `truncate` processor runs on the events, only the first event is truncated because the `id` key contains a value of `1`: + +```json +{"message": "world", "id": 1} +{"message": "hello, world,not-truncated", "id": 2} +``` diff --git a/_data-prepper/pipelines/configuration/sinks/file.md b/_data-prepper/pipelines/configuration/sinks/file.md index 74af5a1803..bd4fec1865 100644 --- a/_data-prepper/pipelines/configuration/sinks/file.md +++ b/_data-prepper/pipelines/configuration/sinks/file.md @@ -17,6 +17,7 @@ The following table describes options you can configure for the `file` sink. Option | Required | Type | Description :--- | :--- | :--- | :--- path | Yes | String | Path for the output file (e.g. `logs/my-transformed-log.log`). +append | No | Boolean | When `true`, the sink file is opened in append mode. ## Usage diff --git a/_data-prepper/pipelines/configuration/sources/s3.md b/_data-prepper/pipelines/configuration/sources/s3.md index 7dc31caade..ad5de6884d 100644 --- a/_data-prepper/pipelines/configuration/sources/s3.md +++ b/_data-prepper/pipelines/configuration/sources/s3.md @@ -99,6 +99,7 @@ buffer_timeout | No | Duration | The amount of time allowed for writing events `s3_select` | No | [s3_select](#s3_select) | The Amazon S3 Select configuration. `scan` | No | [scan](#scan) | The S3 scan configuration. `delete_s3_objects_on_read` | No | Boolean | When `true`, the S3 scan attempts to delete S3 objects after all events from the S3 object are successfully acknowledged by all sinks. `acknowledgments` should be enabled when deleting S3 objects. Default is `false`. +`workers` | No | Integer | The number of worker threads. Default is `1`, with a min of `1` and a max of `1000`. Each worker thread subscribes to Amazon SQS messages. When a worker receives an SQS message, that worker processes the message independently from the other workers. ## sqs From f99c17a727c685e7e30ad6c8d17cddd78d32225d Mon Sep 17 00:00:00 2001 From: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Date: Wed, 27 Mar 2024 11:53:14 -0500 Subject: [PATCH 07/18] Update mutate-event.md (#6796) Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --- .../pipelines/configuration/processors/mutate-event.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/processors/mutate-event.md b/_data-prepper/pipelines/configuration/processors/mutate-event.md index 676cb2e168..8ce9f7d921 100644 --- a/_data-prepper/pipelines/configuration/processors/mutate-event.md +++ b/_data-prepper/pipelines/configuration/processors/mutate-event.md @@ -15,7 +15,7 @@ Mutate event processors allow you to modify events in Data Prepper. The followin * [copy_values]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/copy-values/) allows you to copy values within an event. * [delete_entries]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/delete-entries/) allows you to delete entries from an event. * [list_to_map]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/list-to-map) allows you to convert list of objects from an event where each object contains a `key` field into a map of target keys. -* [map_to_list]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/map-to-list/) allows you to convert a map of objects from an event, where each object contains a `key` field, into a list of target keys. +* `map_to_list` allows you to convert a map of objects from an event, where each object contains a `key` field, into a list of target keys. * [rename_keys]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/rename-keys/) allows you to rename keys in an event. * [select_entries]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/select-entries/) allows you to select entries from an event. From 6f862fa69d674c86a1c19f0e8262da987d8bf2cd Mon Sep 17 00:00:00 2001 From: Peter Nied Date: Wed, 27 Mar 2024 12:04:14 -0500 Subject: [PATCH 08/18] Add documentation for security config upgrade feature (#6634) * Add documentation for security config upgrade feature Signed-off-by: Peter Nied Signed-off-by: Peter Nied * Fix vale annotations Signed-off-by: Peter Nied * Feedback round 1 Signed-off-by: Peter Nied * Resolve OpenSearch.SpacingPunctuation Signed-off-by: Peter Nied * Fix vale error Signed-off-by: Peter Nied * Clean up rendering of list of options for upgrade Signed-off-by: Peter Nied * Clean up formatting around example a little Signed-off-by: Peter Nied * PR Feedback 2 Signed-off-by: Peter Nied Signed-off-by: Peter Nied * Update api.md * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update _security/access-control/api.md Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update _security/access-control/api.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --------- Signed-off-by: Peter Nied Signed-off-by: Peter Nied Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Nathan Bower --- _security/access-control/api.md | 85 +++++++++++++++++++++++++++++++++ 1 file changed, 85 insertions(+) diff --git a/_security/access-control/api.md b/_security/access-control/api.md index acbdb5e0be..8a464bdeb1 100644 --- a/_security/access-control/api.md +++ b/_security/access-control/api.md @@ -1297,6 +1297,91 @@ PATCH _plugins/_security/api/securityconfig } ``` +### Configuration upgrade check + +Introduced 2.13 +{: .label .label-purple } + +Checks the current configuration bundled with the host's Security plugin and compares it to the version of the OpenSearch Security plugin the user downloaded. Then, the API responds indicating whether or not an upgrade can be performed and what resources can be updated. + +With each new OpenSearch version, there are changes to the default security configuration. This endpoint helps cluster operators determine whether the cluster is missing defaults or has stale definitions of defaults. +{: .note} + +#### Request + +```json +GET _plugins/_security/api/_upgrade_check +``` +{% include copy-curl.html %} + +#### Example response + +```json +{ + "status" : "OK", + "upgradeAvailable" : true, + "upgradeActions" : { + "roles" : { + "add" : [ "flow_framework_full_access" ] + } + } +} +``` + +#### Response fields + +| Field | Data type | Description | +|:---------|:-----------|:------------------------------| +| `upgradeAvailable` | Boolean | Responds with `true` when an upgrade to the security configuration is available. | +| `upgradeActions` | Object list | A list of security objects that would be modified when upgrading the host's Security plugin. | + +### Configuration upgrade + +Introduced 2.13 +{: .label .label-purple } + +Adds and updates resources on a host's existing security configuration from the configuration bundled with the latest version of the Security plugin. + +These bundled configuration files can be found in the `/security/config` directory. Default configuration files are updated when OpenSearch is upgraded, whereas the cluster configuration is only updated by the cluster operators. This endpoint helps cluster operator upgrade missing defaults and stale default definitions. + + +#### Request + +```json +POST _plugins/_security/api/_upgrade_perform +{ + "configs" : [ "roles" ] +} +``` +{% include copy-curl.html %} + +#### Request fields + +| Field | Data type | Description | Required | +|:----------------|:-----------|:------------------------------------------------------------------------------------------------------------------|:---------| +| `configs` | Array | Specifies the configurations to be upgraded. This field can include any combination of the following configurations: `actiongroups`,`allowlist`, `audit`, `internalusers`, `nodesdn`, `roles`, `rolesmappings`, `tenants`.
Default is all supported configurations. | No | + + +#### Example response + +```json +{ + "status" : "OK", + "upgrades" : { + "roles" : { + "add" : [ "flow_framework_full_access" ] + } + } +} +``` + +#### Response fields + +| Field | Data type | Description | +|:---------|:-----------|:------------------------------| +| `upgrades` | Object | A container for the upgrade results, organized by configuration type, such as `roles`. Each changed configuration type will be represented as a key in this object. | +| `roles` | Object | Contains a list of role-based action keys of objects modified by the upgrade. | + --- ## Distinguished names From 5a873ab6acc664b5e741977465219ce9d4e822d8 Mon Sep 17 00:00:00 2001 From: Srikanth Govindarajan Date: Wed, 27 Mar 2024 10:16:46 -0700 Subject: [PATCH 09/18] Add split event processor for dataprepper (#6682) * Add split event processor for dataprepper Signed-off-by: srigovs * Adding usage example * Update split-event.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --------- Signed-off-by: srigovs Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Nathan Bower --- .../configuration/processors/split-event.md | 52 +++++++++++++++++++ 1 file changed, 52 insertions(+) create mode 100644 _data-prepper/pipelines/configuration/processors/split-event.md diff --git a/_data-prepper/pipelines/configuration/processors/split-event.md b/_data-prepper/pipelines/configuration/processors/split-event.md new file mode 100644 index 0000000000..2dbdaf0cc0 --- /dev/null +++ b/_data-prepper/pipelines/configuration/processors/split-event.md @@ -0,0 +1,52 @@ +--- +layout: default +title: split-event +parent: Processors +grand_parent: Pipelines +nav_order: 56 +--- + +# split-event + +The `split-event` processor is used to split events based on a delimiter and generates multiple events from a user-specified field. + +## Configuration + +The following table describes the configuration options for the `split-event` processor. + +| Option | Type | Description | +|------------------|---------|-----------------------------------------------------------------------------------------------| +| `field` | String | The event field to be split. | +| `delimiter_regex`| String | The regular expression used as the delimiter for splitting the field. | +| `delimiter` | String | The delimiter used for splitting the field. If not specified, the default delimiter is used. | + +# Usage + +To use the `split-event` processor, add the following to your `pipelines.yaml` file: + +``` +split-event-pipeline: + source: + http: + processor: + - split_event: + field: query + delimiter: ' ' + sink: + - stdout: +``` +{% include copy.html %} + +When an event contains the following example input: + +``` +{"query" : "open source", "some_other_field" : "abc" } +``` + +The input will be split into multiple events based on the `query` field, with the delimiter set as white space, as shown in the following example: + +``` +{"query" : "open", "some_other_field" : "abc" } +{"query" : "source", "some_other_field" : "abc" } +``` + From 48651b0d84b812e119c761044a2f54c30f1f8674 Mon Sep 17 00:00:00 2001 From: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Date: Wed, 27 Mar 2024 15:33:29 -0500 Subject: [PATCH 10/18] Data Prepper 2.7 documentation (#6763) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Obfuscate processor doc (#6387) Signed-off-by: shaavanga * [Data Prepper] MAINT: document on disable_refresh secret extension setting (#6384) * MAINT: document on disable_refresh secret extension setting Signed-off-by: George Chen * Update _data-prepper/managing-data-prepper/configuring-data-prepper.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update _data-prepper/managing-data-prepper/configuring-data-prepper.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --------- Signed-off-by: George Chen Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Updates to the S3 source documentation (#6379) * Updates to the S3 source documentation to include missing features and metrics. Signed-off-by: David Venable * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Try to fix build error * Update s3.md * Update s3.md * Update s3.md * Update s3.md * Update s3.md * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update s3.md * See if removing links removes build error * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --------- Signed-off-by: David Venable Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Nathan Bower * Update list_to_map processor in Data Prepper (#6382) * Update list-to-map processor Signed-off-by: Hai Yan * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --------- Signed-off-by: Hai Yan Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Nathan Bower * Improvements to the S3 sink documenation (#6383) * Corrections and clarifications on the S3 sink. Include an IAM policy and an example Parquet schema. Signed-off-by: David Venable * Updates to the S3 sink to clarify how the object name is generated. Removes an option which does not exist. Signed-off-by: David Venable * Update s3.md * Update s3.md * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update _data-prepper/pipelines/configuration/sinks/s3.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update s3.md * Add David's link Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --------- Signed-off-by: David Venable Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Nathan Bower * Add permissions, metrics, and metadata attributes to Data Prepper dyn… (#6380) * Add permissions, metrics, and metadata attributes to Data Prepper dynamodb source documentation Signed-off-by: Taylor Gray * Address PR comments Signed-off-by: Taylor Gray * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update dynamo-db.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --------- Signed-off-by: Taylor Gray Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Nathan Bower * Date Processor doc update (#6381) * Date Processor doc update Signed-off-by: Asif Sohail Mohammed * Fixed table header indentation Signed-off-by: Asif Sohail Mohammed * Fix formatting and grammar. * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update date.md --------- Signed-off-by: Asif Sohail Mohammed Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Nathan Bower * Add docs for join function in Data Prepper (#6688) * Add docs for join function Signed-off-by: Hai Yan * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update expression-syntax.md * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --------- Signed-off-by: Hai Yan Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Nathan Bower * Add docs for flatten processor for Data Prepper (#6685) * Add docs for flatten processor Signed-off-by: Hai Yan * Add grammar edits * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update flatten.md * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --------- Signed-off-by: Hai Yan Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Nathan Bower * Adds data prepper decompress processor documentation (#6683) * Add data prepper decompress processor documentation Signed-off-by: Taylor Gray * Update decompress.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update decompress.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --------- Signed-off-by: Taylor Gray Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Nathan Bower * Add data prepper documentation for grok performance_metadata (#6681) * Add data prepper grok performance metadata documentation Signed-off-by: Taylor Gray * Update grok.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update grok.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --------- Signed-off-by: Taylor Gray Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Nathan Bower * Documentation for data prepper dynamodb source view_on_remove feature (#6738) * Add documentation for dynamodb source view_on_remove feature Signed-off-by: Taylor Gray * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --------- Signed-off-by: Taylor Gray Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update Data Prepper opensearch sink documentation (#6386) * Update Data Prepper opensearch sink documentation Signed-off-by: Taylor Gray * Formatting fixes and adding introductory text. Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update opensearch.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update opensearch.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update opensearch.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update opensearch.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update opensearch.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --------- Signed-off-by: Taylor Gray Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Nathan Bower * Adds a configuration for the Data Prepper S3 source workers field. (#6774) * Adds a configuration for the Data Prepper S3 source workers field. Signed-off-by: David Venable * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --------- Signed-off-by: David Venable Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Add parse_ion processor (#6761) * Add parse_ion processor Signed-off-by: Archer * Apply suggestions from code review Co-authored-by: David Venable Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update parse-ion.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Heather Halter Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Heather Halter Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --------- Signed-off-by: Archer Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: David Venable Co-authored-by: Heather Halter Co-authored-by: Nathan Bower * Add docs for map_to_list processor (#6680) * Add docs for map_to_list processor Signed-off-by: Hai Yan * Update map-to-list.md * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --------- Signed-off-by: Hai Yan Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Nathan Bower * Adds documentation for the Data Prepper geoip processor and geoip_service extension (#6772) * Adds documentation for the Data Prepper geoip processor and geoip_service extension. Signed-off-by: David Venable * Update extensions.md * Update geoip_service.md * Update geoip.md * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update _data-prepper/managing-data-prepper/extensions/extensions.md Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --------- Signed-off-by: David Venable Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Nathan Bower * Update flatten.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Fix nav order Signed-off-by: Archer --------- Signed-off-by: shaavanga Signed-off-by: George Chen Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Signed-off-by: David Venable Signed-off-by: Hai Yan Signed-off-by: Taylor Gray Signed-off-by: Asif Sohail Mohammed Signed-off-by: Archer Co-authored-by: Prathyusha Vangala <157630736+shaavanga@users.noreply.github.com> Co-authored-by: Qi Chen Co-authored-by: David Venable Co-authored-by: Nathan Bower Co-authored-by: Hai Yan <8153134+oeyh@users.noreply.github.com> Co-authored-by: Taylor Gray Co-authored-by: Asif Sohail Mohammed Co-authored-by: Heather Halter --- .../configuring-data-prepper.md | 4 +- .../extensions/extensions.md | 15 + .../extensions/geoip_service.md | 67 +++++ .../configuration/processors/date.md | 54 +++- .../configuration/processors/decompress.md | 49 ++++ .../processors/delete-entries.md | 2 +- .../configuration/processors/dissect.md | 2 +- .../configuration/processors/drop-events.md | 2 +- .../configuration/processors/flatten.md | 239 +++++++++++++++ .../configuration/processors/geoip.md | 67 +++++ .../configuration/processors/grok.md | 63 ++-- .../configuration/processors/list-to-map.md | 52 +++- .../configuration/processors/map-to-list.md | 277 ++++++++++++++++++ .../configuration/processors/mutate-event.md | 1 + .../configuration/processors/obfuscate.md | 2 + .../configuration/processors/parse-ion.md | 56 ++++ .../configuration/processors/split-event.md | 2 +- .../configuration/sinks/opensearch.md | 152 +++++++--- .../pipelines/configuration/sinks/s3.md | 116 ++++++-- .../configuration/sources/dynamo-db.md | 103 ++++++- .../pipelines/configuration/sources/s3.md | 38 ++- _data-prepper/pipelines/expression-syntax.md | 20 +- 22 files changed, 1281 insertions(+), 102 deletions(-) create mode 100644 _data-prepper/managing-data-prepper/extensions/extensions.md create mode 100644 _data-prepper/managing-data-prepper/extensions/geoip_service.md create mode 100644 _data-prepper/pipelines/configuration/processors/decompress.md create mode 100644 _data-prepper/pipelines/configuration/processors/flatten.md create mode 100644 _data-prepper/pipelines/configuration/processors/geoip.md create mode 100644 _data-prepper/pipelines/configuration/processors/map-to-list.md create mode 100644 _data-prepper/pipelines/configuration/processors/parse-ion.md diff --git a/_data-prepper/managing-data-prepper/configuring-data-prepper.md b/_data-prepper/managing-data-prepper/configuring-data-prepper.md index bcff65ed4c..d6750daba4 100644 --- a/_data-prepper/managing-data-prepper/configuring-data-prepper.md +++ b/_data-prepper/managing-data-prepper/configuring-data-prepper.md @@ -128,6 +128,7 @@ extensions: region: sts_role_arn: refresh_interval: + disable_refresh: false : ... ``` @@ -148,7 +149,8 @@ Option | Required | Type | Description secret_id | Yes | String | The AWS secret name or ARN. | region | No | String | The AWS region of the secret. Defaults to `us-east-1`. sts_role_arn | No | String | The AWS Security Token Service (AWS STS) role to assume for requests to the AWS Secrets Manager. Defaults to `null`, which will use the [standard SDK behavior for credentials](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html). -refresh_interval | No | Duration | The refreshment interval for AWS secrets extension plugin to poll new secret values. Defaults to `PT1H`. See [Automatically refreshing secrets](#automatically-refreshing-secrets) for details. +refresh_interval | No | Duration | The refreshment interval for the AWS Secrets extension plugin to poll new secret values. Defaults to `PT1H`. For more information, see [Automatically refreshing secrets](#automatically-refreshing-secrets). +disable_refresh | No | Boolean | Disables regular polling on the latest secret values inside the AWS secrets extension plugin. Defaults to `false`. When set to `true`, `refresh_interval` will not be used. #### Reference secrets ß diff --git a/_data-prepper/managing-data-prepper/extensions/extensions.md b/_data-prepper/managing-data-prepper/extensions/extensions.md new file mode 100644 index 0000000000..8cbfc602c7 --- /dev/null +++ b/_data-prepper/managing-data-prepper/extensions/extensions.md @@ -0,0 +1,15 @@ +--- +layout: default +title: Extensions +parent: Managing Data Prepper +has_children: true +nav_order: 18 +--- + +# Extensions + +Data Prepper extensions provide Data Prepper functionality outside of core Data Prepper pipeline components. +Many extensions provide configuration options that give Data Prepper administrators greater flexibility over Data Prepper's functionality. + +Extension configurations can be configured in the `data-prepper-config.yaml` file under the `extensions:` YAML block. + diff --git a/_data-prepper/managing-data-prepper/extensions/geoip_service.md b/_data-prepper/managing-data-prepper/extensions/geoip_service.md new file mode 100644 index 0000000000..53c21a08ff --- /dev/null +++ b/_data-prepper/managing-data-prepper/extensions/geoip_service.md @@ -0,0 +1,67 @@ +--- +layout: default +title: geoip_service +nav_order: 5 +parent: Extensions +grand_parent: Managing Data Prepper +--- + +# geoip_service + +The `geoip_service` extension configures all [`geoip`]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/geoip) processors in Data Prepper. + +## Usage + +You can configure the GeoIP service that Data Prepper uses for the `geoip` processor. +By default, the GeoIP service comes with the [`maxmind`](#maxmind) option configured. + +The following example shows how to configure the `geoip_service` in the `data-prepper-config.yaml` file: + +``` +extensions: + geoip_service: + maxmind: + database_refresh_interval: PT1H + cache_count: 16_384 +``` + +## maxmind + +The GeoIP service supports the MaxMind [GeoIP and GeoLite](https://dev.maxmind.com/geoip) databases. +By default, Data Prepper will use all three of the following [MaxMind GeoLite2](https://dev.maxmind.com/geoip/geolite2-free-geolocation-data) databases: + +* City +* Country +* ASN + +The service also downloads databases automatically to keep Data Prepper up to date with changes from MaxMind. + +You can use the following options to configure the `maxmind` extension. + +Option | Required | Type | Description +:--- | :--- | :--- | :--- +`databases` | No | [database](#database) | The database configuration. +`database_refresh_interval` | No | Duration | How frequently to check for updates from MaxMind. This can be any duration in the range of 15 minutes to 30 days. Default is `PT7D`. +`cache_count` | No | Integer | The maximum cache count by number of items in the cache, with a range of 100--100,000. Default is `4096`. +`database_destination` | No | String | The name of the directory in which to store downloaded databases. Default is `{data-prepper.dir}/data/geoip`. +`aws` | No | [aws](#aws) | Configures the AWS credentials for downloading the database from Amazon Simple Storage Service (Amazon S3). +`insecure` | No | Boolean | When `true`, this options allows you to download database files over HTTP. Default is `false`. + +## database + +Option | Required | Type | Description +:--- | :--- | :--- | :--- +`city` | No | String | The URL of the city in which the database resides. Can be an HTTP URL for a manifest file, an MMDB file, or an S3 URL. +`country` | No | String | The URL of the country in which the database resides. Can be an HTTP URL for a manifest file, an MMDB file, or an S3 URL. +`asn` | No | String | The URL of the Autonomous System Number (ASN) of where the database resides. Can be an HTTP URL for a manifest file, an MMDB file, or an S3 URL. +`enterprise` | No | String | The URL of the enterprise in which the database resides. Can be an HTTP URL for a manifest file, an MMDB file, or an S3 URL. + + +## aws + +Option | Required | Type | Description +:--- | :--- | :--- | :--- +`region` | No | String | The AWS Region to use for the credentials. Default is the [standard SDK behavior for determining the Region](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/region-selection.html). +`sts_role_arn` | No | String | The AWS Security Token Service (AWS STS) role to assume for requests to Amazon S3. Default is `null`, which will use the [standard SDK behavior for credentials](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html). +`aws_sts_header_overrides` | No | Map | A map of header overrides that the AWS Identity and Access Management (IAM) role assumes when downloading from Amazon S3. +`sts_external_id` | No | String | An STS external ID used when Data Prepper assumes the STS role. For more information, see the `ExternalID` documentation in the [STS AssumeRole](https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html) API reference. diff --git a/_data-prepper/pipelines/configuration/processors/date.md b/_data-prepper/pipelines/configuration/processors/date.md index 27b571df04..7ac1040c26 100644 --- a/_data-prepper/pipelines/configuration/processors/date.md +++ b/_data-prepper/pipelines/configuration/processors/date.md @@ -9,24 +9,32 @@ nav_order: 50 # date -The `date` processor adds a default timestamp to an event, parses timestamp fields, and converts timestamp information to the International Organization for Standardization (ISO) 8601 format. This timestamp information can be used as an event timestamp. +The `date` processor adds a default timestamp to an event, parses timestamp fields, and converts timestamp information to the International Organization for Standardization (ISO) 8601 format. This timestamp information can be used as an event timestamp. ## Configuration The following table describes the options you can use to configure the `date` processor. + Option | Required | Type | Description :--- | :--- | :--- | :--- -match | Conditionally | List | List of `key` and `patterns` where patterns is a list. The list of match can have exactly one `key` and `patterns`. There is no default value. This option cannot be defined at the same time as `from_time_received`. Include multiple date processors in your pipeline if both options should be used. -from_time_received | Conditionally | Boolean | A boolean that is used for adding default timestamp to event data from event metadata which is the time when source receives the event. Default value is `false`. This option cannot be defined at the same time as `match`. Include multiple date processors in your pipeline if both options should be used. -destination | No | String | Field to store the timestamp parsed by date processor. It can be used with both `match` and `from_time_received`. Default value is `@timestamp`. -source_timezone | No | String | Time zone used to parse dates. It is used in case the zone or offset cannot be extracted from the value. If the zone or offset are part of the value, then timezone is ignored. Find all the available timezones [the list of database time zones](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones#List) in the **TZ database name** column. -destination_timezone | No | String | Timezone used for storing timestamp in `destination` field. The available timezone values are the same as `source_timestamp`. -locale | No | String | Locale is used for parsing dates. It's commonly used for parsing month names(`MMM`). It can have language, country and variant fields using IETF BCP 47 or String representation of [Locale](https://docs.oracle.com/javase/8/docs/api/java/util/Locale.html) object. For example `en-US` for IETF BCP 47 and `en_US` for string representation of Locale. Full list of locale fields which includes language, country and variant can be found [the language subtag registry](https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry). Default value is `Locale.ROOT`. +`match` | Conditionally | [Match](#Match) | The date match configuration. This option cannot be defined at the same time as `from_time_received`. There is no default value. +`from_time_received` | Conditionally | Boolean | When `true`, the timestamp from the event metadata, which is the time at which the source receives the event, is added to the event data. This option cannot be defined at the same time as `match`. Default is `false`. +`date_when` | No | String | Specifies under what condition the `date` processor should perform matching. Default is no condition. +`to_origination_metadata` | No | Boolean | When `true`, the matched time is also added to the event's metadata as an instance of `Instant`. Default is `false`. +`destination` | No | String | The field used to store the timestamp parsed by the date processor. Can be used with both `match` and `from_time_received`. Default is `@timestamp`. +`output_format` | No | String | Determines the format of the timestamp added to an event. Default is `yyyy-MM-dd'T'HH:mm:ss.SSSXXX`. +`source_timezone` | No | String | The time zone used to parse dates, including when the zone or offset cannot be extracted from the value. If the zone or offset are part of the value, then the time zone is ignored. A list of all the available time zones is contained in the **TZ database name** column of [the list of database time zones](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones#List). +`destination_timezone` | No | String | The time zone used for storing the timestamp in the `destination` field. A list of all the available time zones is contained in the **TZ database name** column of [the list of database time zones](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones#List). +`locale` | No | String | The location used for parsing dates. Commonly used for parsing month names (`MMM`). The value can contain language, country, or variant fields in IETF BCP 47, such as `en-US`, or a string representation of the [locale](https://docs.oracle.com/javase/8/docs/api/java/util/Locale.html) object, such as `en_US`. A full list of locale fields, including language, country, and variant, can be found in [the language subtag registry](https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry). Default is `Locale.ROOT`. + - +Option | Required | Type | Description +:--- | :--- | :--- | :--- +`key` | Yes | String | Represents the event key against which to match patterns. Required if `match` is configured. +`patterns` | Yes | List | A list of possible patterns that the timestamp value of the key can have. The patterns are based on a sequence of letters and symbols. The `patterns` support all the patterns listed in the Java [DatetimeFormatter](https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html) reference. The timestamp value also supports `epoch_second`, `epoch_milli`, and `epoch_nano` values, which represent the timestamp as the number of seconds, milliseconds, and nanoseconds since the epoch. Epoch values always use the UTC time zone. ## Metrics @@ -40,5 +48,29 @@ The following table describes common [Abstract processor](https://github.com/ope The `date` processor includes the following custom metrics. -* `dateProcessingMatchSuccessCounter`: Returns the number of records that match with at least one pattern specified by the `match configuration` option. -* `dateProcessingMatchFailureCounter`: Returns the number of records that did not match any of the patterns specified by the `patterns match` configuration option. \ No newline at end of file +* `dateProcessingMatchSuccessCounter`: Returns the number of records that match at least one pattern specified by the `match configuration` option. +* `dateProcessingMatchFailureCounter`: Returns the number of records that did not match any of the patterns specified by the `patterns match` configuration option. + +## Example: Add the default timestamp to an event +The following `date` processor configuration can be used to add a default timestamp in the `@timestamp` filed applied to all events: + +```yaml +- date: + from_time_received: true + destination: "@timestamp" +``` + +## Example: Parse a timestamp to convert its format and time zone +The following `date` processor configuration can be used to parse the value of the timestamp applied to `dd/MMM/yyyy:HH:mm:ss` and write it in `yyyy-MM-dd'T'HH:mm:ss.SSSXXX` format: + +```yaml +- date: + match: + - key: timestamp + patterns: ["dd/MMM/yyyy:HH:mm:ss"] + destination: "@timestamp" + output_format: "yyyy-MM-dd'T'HH:mm:ss.SSSXXX" + source_timezone: "America/Los_Angeles" + destination_timezone: "America/Chicago" + locale: "en_US" +``` diff --git a/_data-prepper/pipelines/configuration/processors/decompress.md b/_data-prepper/pipelines/configuration/processors/decompress.md new file mode 100644 index 0000000000..d03c236ac5 --- /dev/null +++ b/_data-prepper/pipelines/configuration/processors/decompress.md @@ -0,0 +1,49 @@ +--- +layout: default +title: decompress +parent: Processors +grand_parent: Pipelines +nav_order: 40 +--- + +# decompress + +The `decompress` processor decompresses any Base64-encoded compressed fields inside of an event. + +## Configuration + +Option | Required | Type | Description +:--- | :--- | :--- | :--- +`keys` | Yes | List | The fields in the event that will be decompressed. +`type` | Yes | Enum | The type of decompression to use for the `keys` in the event. Only `gzip` is supported. +`decompress_when` | No | String| A [Data Prepper conditional expression](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/) that determines when the `decompress` processor will run on certain events. +`tags_on_failure` | No | List | A list of strings with which to tag events when the processor fails to decompress the `keys` inside an event. Defaults to `_decompression_failure`. + +## Usage + +The following example shows the `decompress` processor used in `pipelines.yaml`: + +```yaml +processor: + - decompress: + decompress_when: '/some_key == null' + keys: [ "base_64_gzip_key" ] + type: gzip +``` + +## Metrics + +The following table describes common [abstract processor](https://github.com/opensearch-project/data-prepper/blob/main/data-prepper-api/src/main/java/org/opensearch/dataprepper/model/processor/AbstractProcessor.java) metrics. + +| Metric name | Type | Description | +| ------------- | ---- | -----------| +| `recordsIn` | Counter | The ingress of records to a pipeline component. | +| `recordsOut` | Counter | The egress of records from a pipeline component. | +| `timeElapsed` | Timer | The time elapsed during execution of a pipeline component. | + +### Counter + +The `decompress` processor accounts for the following metrics: + +* `processingErrors`: The number of processing errors that have occurred in the `decompress` processor. + diff --git a/_data-prepper/pipelines/configuration/processors/delete-entries.md b/_data-prepper/pipelines/configuration/processors/delete-entries.md index 0546ed67c4..33c54a0b29 100644 --- a/_data-prepper/pipelines/configuration/processors/delete-entries.md +++ b/_data-prepper/pipelines/configuration/processors/delete-entries.md @@ -3,7 +3,7 @@ layout: default title: delete_entries parent: Processors grand_parent: Pipelines -nav_order: 51 +nav_order: 41 --- # delete_entries diff --git a/_data-prepper/pipelines/configuration/processors/dissect.md b/_data-prepper/pipelines/configuration/processors/dissect.md index 2d32ba47ae..a8258bee4e 100644 --- a/_data-prepper/pipelines/configuration/processors/dissect.md +++ b/_data-prepper/pipelines/configuration/processors/dissect.md @@ -3,7 +3,7 @@ layout: default title: dissect parent: Processors grand_parent: Pipelines -nav_order: 52 +nav_order: 45 --- # dissect diff --git a/_data-prepper/pipelines/configuration/processors/drop-events.md b/_data-prepper/pipelines/configuration/processors/drop-events.md index d030f14a27..1f601c9743 100644 --- a/_data-prepper/pipelines/configuration/processors/drop-events.md +++ b/_data-prepper/pipelines/configuration/processors/drop-events.md @@ -3,7 +3,7 @@ layout: default title: drop_events parent: Processors grand_parent: Pipelines -nav_order: 53 +nav_order: 46 --- # drop_events diff --git a/_data-prepper/pipelines/configuration/processors/flatten.md b/_data-prepper/pipelines/configuration/processors/flatten.md new file mode 100644 index 0000000000..43793c2b83 --- /dev/null +++ b/_data-prepper/pipelines/configuration/processors/flatten.md @@ -0,0 +1,239 @@ +--- +layout: default +title: flatten +parent: Processors +grand_parent: Pipelines +nav_order: 48 +--- + +# flatten + +The `flatten` processor transforms nested objects inside of events into flattened structures. + +## Configuration + +The following table describes configuration options for the `flatten` processor. + +Option | Required | Type | Description +:--- | :--- | :--- | :--- +`source` | Yes | String | The source key on which to perform the operation. If set to an empty string (`""`), then the processor uses the root of the event as the source. +`target` | Yes | String | The target key to put into the flattened fields. If set to an empty string (`""`), then the processor uses the root of the event as the target. +`exclude_keys` | No | List | The keys from the source field that should be excluded from processing. Default is an empty list (`[]`). +`remove_processed_fields` | No | Boolean | When `true`, the processor removes all processed fields from the source. Default is `false`. +`remove_list_indices` | No | Boolean | When `true`, the processor converts the fields from the source map into lists and puts the lists into the target field. Default is `false`. +`flatten_when` | No | String | A [conditional expression](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/), such as `/some-key == "test"'`, that determines whether the `flatten` processor will be run on the event. Default is `null`, which means that all events will be processed unless otherwise stated. +`tags_on_failure` | No | List | A list of tags to add to the event metadata when the event fails to process. + +## Usage + +The following examples show how the `flatten` processor can be used in Data Prepper pipelines. + +### Minimum configuration + +The following example shows only the parameters that are required for using the `flatten` processor, `source` and `target`: + +```yaml +... + processor: + - flatten: + source: "key2" + target: "flattened-key2" +... +``` +{% include copy.html %} + +For example, when the input event contains the following nested objects: + +```json +{ + "key1": "val1", + "key2": { + "key3": { + "key4": "val2" + } + } +} +``` + +The `flatten` processor creates a flattened structure under the `flattened-key2` object, as shown in the following output: + +```json +{ + "key1": "val1", + "key2": { + "key3": { + "key4": "val2" + } + }, + "flattened-key2": { + "key3.key4": "val2" + } +} +``` + +### Remove processed fields + +Use the `remove_processed_fields` option when flattening all of an event's nested objects. This removes all the event's processed fields, as shown in the following example: + +```yaml +... + processor: + - flatten: + source: "" # empty string represents root of event + target: "" # empty string represents root of event + remove_processed_fields: true +... +``` + +For example, when the input event contains the following nested objects: + +```json +{ + "key1": "val1", + "key2": { + "key3": { + "key4": "val2" + } + }, + "list1": [ + { + "list2": [ + { + "name": "name1", + "value": "value1" + }, + { + "name": "name2", + "value": "value2" + } + ] + } + ] +} +``` + + +The `flatten` processor creates a flattened structure in which all processed fields are absent, as shown in the following output: + +```json +{ + "key1": "val1", + "key2.key3.key4": "val2", + "list1[0].list2[0].name": "name1", + "list1[0].list2[0].value": "value1", + "list1[0].list2[1].name": "name2", + "list1[0].list2[1].value": "value2", +} +``` + +### Exclude specific keys from flattening + +Use the `exclude_keys` option to prevent specific keys from being flattened in the output, as shown in the following example, where the `key2` value is excluded: + +```yaml +... + processor: + - flatten: + source: "" # empty string represents root of event + target: "" # empty string represents root of event + remove_processed_fields: true + exclude_keys: ["key2"] +... +``` + +For example, when the input event contains the following nested objects: + +```json +{ + "key1": "val1", + "key2": { + "key3": { + "key4": "val2" + } + }, + "list1": [ + { + "list2": [ + { + "name": "name1", + "value": "value1" + }, + { + "name": "name2", + "value": "value2" + } + ] + } + ] +} +``` + +All other nested objects in the input event, excluding the `key2` key, will be flattened, as shown in the following example: + +```json +{ + "key1": "val1", + "key2": { + "key3": { + "key4": "val2" + } + }, + "list1[0].list2[0].name": "name1", + "list1[0].list2[0].value": "value1", + "list1[0].list2[1].name": "name2", + "list1[0].list2[1].value": "value2", +} +``` + +### Remove list indexes + +Use the `remove_list_indices` option to convert the fields from the source map into lists and put the lists into the target field, as shown in the following example: + +```yaml +... + processor: + - flatten: + source: "" # empty string represents root of event + target: "" # empty string represents root of event + remove_processed_fields: true + remove_list_indices: true +... +``` + +For example, when the input event contains the following nested objects: + +```json +{ + "key1": "val1", + "key2": { + "key3": { + "key4": "val2" + } + }, + "list1": [ + { + "list2": [ + { + "name": "name1", + "value": "value1" + }, + { + "name": "name2", + "value": "value2" + } + ] + } + ] +} +``` + +The processor removes all indexes from the output and places them into the source map as a flattened, structured list, as shown in the following example: + +```json +{ + "key1": "val1", + "key2.key3.key4": "val2", + "list1[].list2[].name": ["name1","name2"], + "list1[].list2[].value": ["value1","value2"] +} +``` diff --git a/_data-prepper/pipelines/configuration/processors/geoip.md b/_data-prepper/pipelines/configuration/processors/geoip.md new file mode 100644 index 0000000000..b7418c66c6 --- /dev/null +++ b/_data-prepper/pipelines/configuration/processors/geoip.md @@ -0,0 +1,67 @@ +--- +layout: default +title: geoip +parent: Processors +grand_parent: Pipelines +nav_order: 49 +--- + +# geoip + +The `geoip` processor enriches events with geographic information extracted from IP addresses contained in the events. +By default, Data Prepper uses the [MaxMind GeoLite2](https://dev.maxmind.com/geoip/geolite2-free-geolocation-data) geolocation database. +Data Prepper administrators can configure the databases using the [`geoip_service`]({{site.url}}{{site.baseurl}}/data-prepper/managing-data-prepper/extensions/geoip_service) extension configuration. + +## Usage + +You can configure the `geoip` processor to work on entries. + +The minimal configuration requires at least one entry, and each entry at least one source field. + +The following configuration extracts all available geolocation data from the IP address provided in the field named `clientip`. +It will write the geolocation data to a new field named `geo`, the default source when none is configured: + +``` +my-pipeline: + processor: + - geoip: + entries: + - source: clientip +``` + +The following example excludes Autonomous System Number (ASN) fields and puts the geolocation data into a field named `clientlocation`: + +``` +my-pipeline: + processor: + - geoip: + entries: + - source: clientip + target: clientlocation + include_fields: [asn, asn_organization, network] +``` + + +## Configuration + +You can use the following options to configure the `geoip` processor. + +Option | Required | Type | Description +:--- | :--- | :--- | :--- +`entries` | Yes | [entry](#entry) list | The list of entries marked for enrichment. +`geoip_when` | No | String | Specifies under what condition the `geoip` processor should perform matching. Default is no condition. +`tags_on_no_valid_ip` | No | String | The tags to add to the event metadata if the source field is not a valid IP address. This includes the localhost IP address. +`tags_on_ip_not_found` | No | String | The tags to add to the event metadata if the `geoip` processor is unable to find a location for the IP address. +`tags_on_engine_failure` | No | String | The tags to add to the event metadata if the `geoip` processor is unable to enrich an event due to an engine failure. + +## entry + +The following parameters allow you to configure a single geolocation entry. Each entry corresponds to a single IP address. + +Option | Required | Type | Description +:--- | :--- | :--- | :--- +`source` | Yes | String | The key of the source field containing the IP address to geolocate. +`target` | No | String | The key of the target field in which to save the geolocation data. Default is `geo`. +`include_fields` | No | String list | The list of geolocation fields to include in the `target` object. By default, this is all the fields provided by the configured databases. +`exclude_fields` | No | String list | The list of geolocation fields to exclude from the `target` object. + diff --git a/_data-prepper/pipelines/configuration/processors/grok.md b/_data-prepper/pipelines/configuration/processors/grok.md index d1eea278d2..16f72c4968 100644 --- a/_data-prepper/pipelines/configuration/processors/grok.md +++ b/_data-prepper/pipelines/configuration/processors/grok.md @@ -3,7 +3,7 @@ layout: default title: Grok parent: Processors grand_parent: Pipelines -nav_order: 54 +nav_order: 50 --- # Grok @@ -15,26 +15,25 @@ The Grok processor uses pattern matching to structure and extract important keys The following table describes options you can use with the Grok processor to structure your data and make your data easier to query. Option | Required | Type | Description -:--- | :--- | :--- | :--- -break_on_match | No | Boolean | Specifies whether to match all patterns or stop once the first successful match is found. Default value is `true`. -grok_when | No | String | Specifies under what condition the `Grok` processor should perform matching. Default is no condition. -keep_empty_captures | No | Boolean | Enables the preservation of `null` captures. Default value is `false`. -keys_to_overwrite | No | List | Specifies which existing keys will be overwritten if there is a capture with the same key value. Default value is `[]`. -match | No | Map | Specifies which keys to match specific patterns against. Default value is an empty body. -named_captures_only | No | Boolean | Specifies whether to keep only named captures. Default value is `true`. -pattern_definitions | No | Map | Allows for custom pattern use inline. Default value is an empty body. -patterns_directories | No | List | Specifies the path of directories that contain customer pattern files. Default value is an empty list. -pattern_files_glob | No | String | Specifies which pattern files to use from the directories specified for `pattern_directories`. Default value is `*`. -target_key | No | String | Specifies a parent-level key used to store all captures. Default value is `null`. -timeout_millis | No | Integer | The maximum amount of time during which matching occurs. Setting to `0` disables the timeout. Default value is `30,000`. - - +:--- | :--- |:--- | :--- +`break_on_match` | No | Boolean | Specifies whether to match all patterns (`true`) or stop once the first successful match is found (`false`). Default is `true`. +`grok_when` | No | String | Specifies under what condition the `grok` processor should perform matching. Default is no condition. +`keep_empty_captures` | No | Boolean | Enables the preservation of `null` captures from the processed output. Default is `false`. +`keys_to_overwrite` | No | List | Specifies which existing keys will be overwritten if there is a capture with the same key value. Default is `[]`. +`match` | No | Map | Specifies which keys should match specific patterns. Default is an empty response body. +`named_captures_only` | No | Boolean | Specifies whether to keep only named captures. Default is `true`. +`pattern_definitions` | No | Map | Allows for a custom pattern that can be used inline inside the response body. Default is an empty response body. +`patterns_directories` | No | List | Specifies which directory paths contain the custom pattern files. Default is an empty list. +`pattern_files_glob` | No | String | Specifies which pattern files to use from the directories specified for `pattern_directories`. Default is `*`. +`target_key` | No | String | Specifies a parent-level key used to store all captures. Default value is `null`. +`timeout_millis` | No | Integer | The maximum amount of time during which matching occurs. Setting to `0` prevents any matching from occurring. Default is `30,000`. +`performance_metadata` | No | Boolean | Whether or not to add the performance metadata to events. Default is `false`. For more information, see [Grok performance metadata](#grok-performance-metadata). + ## Conditional grok -The Grok processor can be configured to run conditionally by using the `grok_when` option. The following is an example Grok processor configuration that uses `grok_when`: +The `grok` processor can be configured to run conditionally by using the `grok_when` option. The following is an example Grok processor configuration that uses `grok_when`: + ``` processor: - grok: @@ -46,8 +45,36 @@ processor: match: message: ['%{IPV6:clientip} %{WORD:request} %{POSINT:bytes}'] ``` +{% include copy.html %} + The `grok_when` option can take a conditional expression. This expression is detailed in the [Expression syntax](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/) documentation. +## Grok performance metadata + +When the `performance_metadata` option is set to `true`, the `grok` processor adds the following metadata keys to each event: + +* `_total_grok_processing_time`: The total amount of time, in milliseconds, that the `grok` processor takes to match the event. This is the sum of the processing time based on all of the `grok` processors that ran on the event and have the `performance_metadata` option enabled. +* `_total_grok_patterns_attempted`: The total number of `grok` pattern match attempts across all `grok` processors that ran on the event. + +To include Grok performance metadata when the event is sent to the sink inside the pipeline, use the `add_entries` processor to describe the metadata you want to include, as shown in the following example: + + +```yaml +processor: + - grok: + performance_metadata: true + match: + log: "%{COMMONAPACHELOG"} + - add_entries: + entries: + - add_when: 'getMetadata("_total_grok_patterns_attempted") != null' + key: "grok_patterns_attempted" + value_expression: 'getMetadata("_total_grok_patterns_attempted")' + - add_when: 'getMetadata("_total_grok_processing_time") != null' + key: "grok_time_spent" + value_expression: 'getMetadata("_total_grok_processing_time")' +``` + ## Metrics The following table describes common [Abstract processor](https://github.com/opensearch-project/data-prepper/blob/main/data-prepper-api/src/main/java/org/opensearch/dataprepper/model/processor/AbstractProcessor.java) metrics. diff --git a/_data-prepper/pipelines/configuration/processors/list-to-map.md b/_data-prepper/pipelines/configuration/processors/list-to-map.md index 4b137f5ce8..15a90ffc24 100644 --- a/_data-prepper/pipelines/configuration/processors/list-to-map.md +++ b/_data-prepper/pipelines/configuration/processors/list-to-map.md @@ -16,10 +16,12 @@ The following table describes the configuration options used to generate target Option | Required | Type | Description :--- | :--- | :--- | :--- -`key` | Yes | String | The key of the fields to be extracted as keys in the generated mappings. `source` | Yes | String | The list of objects with `key` fields to be converted into keys for the generated map. `target` | No | String | The target for the generated map. When not specified, the generated map will be placed in the root node. +`key` | Conditionally | String | The key of the fields to be extracted as keys in the generated mappings. Must be specified if `use_source_key` is `false`. +`use_source_key` | No | Boolean | When `true`, keys in the generated map will use original keys from the source. Default is `false`. `value_key` | No | String | When specified, values given a `value_key` in objects contained in the source list will be extracted and converted into the value specified by this option based on the generated map. When not specified, objects contained in the source list retain their original value when mapped. +`extract_value` | No | Boolean | When `true`, object values from the source list will be extracted and added to the generated map. When `false`, object values from the source list are added to the generated map as they appear in the source list. Default is `false` `flatten` | No | Boolean | When `true`, values in the generated map output flatten into single items based on the `flattened_element`. Otherwise, objects mapped to values from the generated map appear as lists. `flattened_element` | Conditionally | String | The element to keep, either `first` or `last`, when `flatten` is set to `true`. @@ -302,4 +304,52 @@ Some objects in the response may have more than one element in their values, as "val-c" ] } +``` + +### Example: `use_source_key` and `extract_value` set to `true` + +The following example `pipeline.yaml` file sets `flatten` to `false`, causing the processor to output values from the generated map as a list: + +```yaml +pipeline: + source: + file: + path: "/full/path/to/logs_json.log" + record_type: "event" + format: "json" + processor: + - list_to_map: + source: "mylist" + use_source_key: true + extract_value: true + sink: + - stdout: +``` +{% include copy.html %} + +Object values from `mylist` are extracted and added to fields with the source keys `name` and `value`, as shown in the following response: + +```json +{ + "mylist": [ + { + "name": "a", + "value": "val-a" + }, + { + "name": "b", + "value": "val-b1" + }, + { + "name": "b", + "value": "val-b2" + }, + { + "name": "c", + "value": "val-c" + } + ], + "name": ["a", "b", "b", "c"], + "value": ["val-a", "val-b1", "val-b2", "val-c"] +} ``` \ No newline at end of file diff --git a/_data-prepper/pipelines/configuration/processors/map-to-list.md b/_data-prepper/pipelines/configuration/processors/map-to-list.md new file mode 100644 index 0000000000..f3393e6c46 --- /dev/null +++ b/_data-prepper/pipelines/configuration/processors/map-to-list.md @@ -0,0 +1,277 @@ +--- +layout: default +title: map_to_list +parent: Processors +grand_parent: Pipelines +nav_order: 63 +--- + +# map_to_list + +The `map_to_list` processor converts a map of key-value pairs to a list of objects. Each object contains the key and value in separate fields. + +## Configuration + +The following table describes the configuration options for the `map_to_list` processor. + +Option | Required | Type | Description +:--- | :--- | :--- | :--- +`source` | Yes | String | The source map used to perform the mapping operation. When set to an empty string (`""`), it will use the root of the event as the `source`. +`target` | Yes | String | The target for the generated list. +`key_name` | No | String | The name of the field in which to store the original key. Default is `key`. +`value_name` | No | String | The name of the field in which to store the original value. Default is `value`. +`exclude_keys` | No | List | The keys in the source map that will be excluded from processing. Default is an empty list (`[]`). +`remove_processed_fields` | No | Boolean | When `true`, the processor will remove the processed fields from the source map. Default is `false`. +`convert_field_to_list` | No | Boolean | If `true`, the processor will convert the fields from the source map into lists and place them in fields in the target list. Default is `false`. +`map_to_list_when` | No | String | A [conditional expression](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/), such as `/some-key == "test"'`, that will be evaluated to determine whether the processor will be run on the event. Default is `null`. All events will be processed unless otherwise stated. +`tags_on_failure` | No | List | A list of tags to add to the event metadata when the event fails to process. + +## Usage + +The following examples show how the `map_to_list` processor can be used in your pipeline. + +### Example: Minimum configuration + +The following example shows the `map_to_list` processor with only the required parameters, `source` and `target`, configured: + +```yaml +... + processor: + - map_to_list: + source: "my-map" + target: "my-list" +... +``` +{% include copy.html %} + +When the input event contains the following data: + +```json +{ + "my-map": { + "key1": "value1", + "key2": "value2", + "key3": "value3" + } +} +``` + + +The processed event will contain the following output: + +```json +{ + "my-list": [ + { + "key": "key1", + "value": "value1" + }, + { + "key": "key2", + "value": "value2" + }, + { + "key": "key3", + "value": "value3" + } + ], + "my-map": { + "key1": "value1", + "key2": "value2", + "key3": "value3" + } +} +``` + +### Example: Custom key name and value name + +The following example shows how to configure a custom key name and value name: + +```yaml +... + processor: + - map_to_list: + source: "my-map" + target: "my-list" + key_name: "name" + value_name: "data" +... +``` +{% include copy.html %} + +When the input event contains the following data: + +```json +{ + "my-map": { + "key1": "value1", + "key2": "value2", + "key3": "value3" + } +} +``` + +The processed event will contain the following output: + +```json +{ + "my-list": [ + { + "name": "key1", + "data": "value1" + }, + { + "name": "key2", + "data": "value2" + }, + { + "name": "key3", + "data": "value3" + } + ], + "my-map": { + "key1": "value1", + "key2": "value2", + "key3": "value3" + } +} +``` + +### Example: Exclude specific keys from processing and remove any processed fields + +The following example shows how to exclude specific keys and remove any processed fields from the output: + +```yaml +... + processor: + - map_to_list: + source: "my-map" + target: "my-list" + exclude_keys: ["key1"] + remove_processed_fields: true +... +``` +{% include copy.html %} + +When the input event contains the following data: +```json +{ + "my-map": { + "key1": "value1", + "key2": "value2", + "key3": "value3" + } +} +``` + +The processed event will remove the "key2" and "key3" fields, but the "my-map" object, "key1", will remain, as shown in the following output: + +```json +{ + "my-list": [ + { + "key": "key2", + "value": "value2" + }, + { + "key": "key3", + "value": "value3" + } + ], + "my-map": { + "key1": "value1" + } +} +``` + +### Example: Use convert_field_to_list + +The following example shows how to use the `convert_field_to_list` option in the processor: + +```yaml +... + processor: + - map_to_list: + source: "my-map" + target: "my-list" + convert_field_to_list: true +... +``` +{% include copy.html %} + +When the input event contains the following data: + +```json +{ + "my-map": { + "key1": "value1", + "key2": "value2", + "key3": "value3" + } +} +``` + +The processed event will convert all fields into lists, as shown in the following output: + +```json +{ + "my-list": [ + ["key1", "value1"], + ["key2", "value2"], + ["key3", "value3"] + ], + "my-map": { + "key1": "value1", + "key2": "value2", + "key3": "value3" + } +} +``` + +### Example: Use the event root as the source + +The following example shows how you can use an event's root as the source by setting the `source` setting to an empty string (`""`): + +```yaml +... + processor: + - map_to_list: + source: "" + target: "my-list" +... +``` +{% include copy.html %} + +When the input event contains the following data: + +```json +{ + "key1": "value1", + "key2": "value2", + "key3": "value3" +} +``` + +The processed event will contain the following output: + +```json +{ + "my-list": [ + { + "key": "key1", + "value": "value1" + }, + { + "key": "key2", + "value": "value2" + }, + { + "key": "key3", + "value": "value3" + } + ], + "key1": "value1", + "key2": "value2", + "key3": "value3" +} +``` diff --git a/_data-prepper/pipelines/configuration/processors/mutate-event.md b/_data-prepper/pipelines/configuration/processors/mutate-event.md index 8ce9f7d921..9b3b2afb33 100644 --- a/_data-prepper/pipelines/configuration/processors/mutate-event.md +++ b/_data-prepper/pipelines/configuration/processors/mutate-event.md @@ -21,3 +21,4 @@ Mutate event processors allow you to modify events in Data Prepper. The followin + diff --git a/_data-prepper/pipelines/configuration/processors/obfuscate.md b/_data-prepper/pipelines/configuration/processors/obfuscate.md index 4c33d8baab..13d906acb3 100644 --- a/_data-prepper/pipelines/configuration/processors/obfuscate.md +++ b/_data-prepper/pipelines/configuration/processors/obfuscate.md @@ -67,6 +67,8 @@ Use the following configuration options with the `obfuscate` processor. | `source` | Yes | The source field to obfuscate. | | `target` | No | The new field in which to store the obfuscated value. This leaves the original source field unchanged. When no `target` is provided, the source field updates with the obfuscated value. | | `patterns` | No | A list of regex patterns that allow you to obfuscate specific parts of a field. Only parts that match the regex pattern will obfuscate. When not provided, the processor obfuscates the whole field. | +| `obfuscate_when` | No | Specifies under what condition the Obfuscate processor should perform matching. Default is no condition. | +| `tags_on_match_failure` | No | The tag to add to an event if the obfuscate processor fails to match the pattern. | | `action` | No | The obfuscation action. As of Data Prepper 2.3, only the `mask` action is supported. | You can customize the `mask` action with the following optional configuration options. diff --git a/_data-prepper/pipelines/configuration/processors/parse-ion.md b/_data-prepper/pipelines/configuration/processors/parse-ion.md new file mode 100644 index 0000000000..0edd446c42 --- /dev/null +++ b/_data-prepper/pipelines/configuration/processors/parse-ion.md @@ -0,0 +1,56 @@ +--- +layout: default +title: parse_ion +parent: Processors +grand_parent: Pipelines +nav_order: 79 +--- + +# parse_ion + +The `parse_ion` processor parses [Amazon Ion](https://amazon-ion.github.io/ion-docs/) data. + +## Configuration + +You can configure the `parse_ion` processor with the following options. + +| Option | Required | Type | Description | +| :--- | :--- | :--- | :--- | +| `source` | No | String | The field in the `event` that is parsed. Default value is `message`. | +| `destination` | No | String | The destination field of the parsed JSON. Defaults to the root of the `event`. Cannot be `""`, `/`, or any white-space-only `string` because these are not valid `event` fields. | +| `pointer` | No | String | A JSON pointer to the field to be parsed. There is no `pointer` by default, meaning that the entire `source` is parsed. The `pointer` can access JSON array indexes as well. If the JSON pointer is invalid, then the entire `source` data is parsed into the outgoing `event`. If the key that is pointed to already exists in the `event` and the `destination` is the root, then the pointer uses the entire path of the key. | +| `tags_on_failure` | No | String | A list of strings that specify the tags to be set in the event that the processors fails or an unknown exception occurs while parsing. + +## Usage + +The following examples show how to use the `parse_ion` processor in your pipeline. + +### Example: Minimum configuration + +The following example shows the minimum configuration for the `parse_ion` processor: + +```yaml +parse-json-pipeline: + source: + stdin: + processor: + - parse_json: + source: "my_ion" + sink: + - stdout: +``` +{% include copy.html %} + +When the input event contains the following data: + +``` +{"my_ion": "{ion_value1: \"hello\", ion_value2: \"world\"}"} +``` + +The processor parses the event into the following output: + +``` +{"ion_value1": "hello", "ion_value2" : "world"} +``` + + diff --git a/_data-prepper/pipelines/configuration/processors/split-event.md b/_data-prepper/pipelines/configuration/processors/split-event.md index 2dbdaf0cc0..f059fe5b95 100644 --- a/_data-prepper/pipelines/configuration/processors/split-event.md +++ b/_data-prepper/pipelines/configuration/processors/split-event.md @@ -3,7 +3,7 @@ layout: default title: split-event parent: Processors grand_parent: Pipelines -nav_order: 56 +nav_order: 96 --- # split-event diff --git a/_data-prepper/pipelines/configuration/sinks/opensearch.md b/_data-prepper/pipelines/configuration/sinks/opensearch.md index b4861f68fd..d485fbb2b9 100644 --- a/_data-prepper/pipelines/configuration/sinks/opensearch.md +++ b/_data-prepper/pipelines/configuration/sinks/opensearch.md @@ -50,45 +50,82 @@ pipeline: The following table describes options you can configure for the `opensearch` sink. + +Option | Required | Type | Description +:--- | :--- |:---| :--- +`hosts` | Yes | List | A list of OpenSearch hosts to write to, such as `["https://localhost:9200", "https://remote-cluster:9200"]`. +`cert` | No | String | The path to the security certificate. For example, `"config/root-ca.pem"` if the cluster uses the OpenSearch Security plugin. +`username` | No | String | The username for HTTP basic authentication. +`password` | No | String | The password for HTTP basic authentication. +`aws` | No | AWS | The [AWS](#aws) configuration. +[max_retries](#configure-max_retries) | No | Integer | The maximum number of times that the `opensearch` sink should try to push data to the OpenSearch server before considering it to be a failure. Defaults to `Integer.MAX_VALUE`. When not provided, the sink will try to push data to the OpenSearch server indefinitely and exponential backoff will increase the waiting time before a retry. +`aws_sigv4` | No | Boolean | **Deprecated in Data Prepper 2.7.** Default is `false`. Whether to use AWS Identity and Access Management (IAM) signing to connect to an Amazon OpenSearch Service domain. For your access key, secret key, and optional session token, Data Prepper uses the default credential chain (environment variables, Java system properties, `~/.aws/credential`). +`aws_region` | No | String | **Deprecated in Data Prepper 2.7.** The AWS Region (for example, `"us-east-1"`) for the domain when you are connecting to Amazon OpenSearch Service. +`aws_sts_role_arn` | No | String | **Deprecated in Data Prepper 2.7.** The IAM role that the plugin uses to sign requests sent to Amazon OpenSearch Service. If this information is not provided, then the plugin uses the default credentials. +`socket_timeout` | No | Integer | The timeout value, in milliseconds, when waiting for data to be returned (the maximum period of inactivity between two consecutive data packets). A timeout value of `0` is interpreted as an infinite timeout. If this timeout value is negative or not set, then the underlying Apache HttpClient will rely on operating system settings to manage socket timeouts. +`connect_timeout` | No | Integer| The timeout value, in milliseconds, when requesting a connection from the connection manager. A timeout value of `0` is interpreted as an infinite timeout. If this timeout value is negative or not set, the underlying Apache HttpClient will rely on operating system settings to manage connection timeouts. +`insecure` | No | Boolean | Whether or not to verify SSL certificates. If set to `true`, then certificate authority (CA) certificate verification is disabled and insecure HTTP requests are sent instead. Default is `false`. +`proxy` | No | String | The address of the [forward HTTP proxy server](https://en.wikipedia.org/wiki/Proxy_server). The format is `"<hostname or IP>:<port>"` (for example, `"example.com:8100"`, `"http://example.com:8100"`, `"112.112.112.112:8100"`). The port number cannot be omitted. +`index` | Conditionally | String | The name of the export index. Only required when the `index_type` is `custom`. The index can be a plain string, such as `my-index-name`, contain [Java date-time patterns](https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html), such as `my-index-${yyyy.MM.dd}` or `my-${yyyy-MM-dd-HH}-index`, be formatted using field values, such as `my-index-${/my_field}`, or use [Data Prepper expressions](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/), such as `my-index-${getMetadata(\"my_metadata_field\"}`. All formatting options can be combined to provide flexibility when creating static, dynamic, and rolling indexes. +`index_type` | No | String | Tells the sink plugin what type of data it is handling. Valid values are `custom`, `trace-analytics-raw`, `trace-analytics-service-map`, or `management-disabled`. Default is `custom`. +`template_type` | No | String | Defines what type of OpenSearch template to use. Available options are `v1` and `index-template`. The default value is `v1`, which uses the original OpenSearch templates available at the `_template` API endpoints. The `index-template` option uses composable [index templates]({{site.url}}{{site.baseurl}}/opensearch/index-templates/), which are available through the OpenSearch `_index_template` API. Composable index types offer more flexibility than the default and are necessary when an OpenSearch cluster contains existing index templates. Composable templates are available for all versions of OpenSearch and some later versions of Elasticsearch. When `distribution_version` is set to `es6`, Data Prepper enforces the `template_type` as `v1`. +`template_file` | No | String | The path to a JSON [index template]({{site.url}}{{site.baseurl}}/opensearch/index-templates/) file, such as `/your/local/template-file.json`, when `index_type` is set to `custom`. For an example template file, see [otel-v1-apm-span-index-template.json](https://github.com/opensearch-project/data-prepper/blob/main/data-prepper-plugins/opensearch/src/main/resources/otel-v1-apm-span-index-template.json). If you supply a template file, then it must match the template format specified by the `template_type` parameter. +`template_content` | No | JSON | Contains all the inline JSON found inside of the index [index template]({{site.url}}{{site.baseurl}}/opensearch/index-templates/). For an example of template content, see [the example template content](#example_template_content). +`document_id_field` | No | String | **Deprecated in Data Prepper 2.7 in favor of `document_id`.** The field from the source data to use for the OpenSearch document ID (for example, `"my-field"`) if `index_type` is `custom`. +`document_id` | No | String | A format string to use as the `_id` in OpenSearch documents. To specify a single field in an event, use `${/my_field}`. You can also use Data Prepper expressions to construct the `document_id`, for example, `${getMetadata(\"some_metadata_key\")}`. These options can be combined into more complex formats, such as `${/my_field}-test-${getMetadata(\"some_metadata_key\")}`. +`document_version` | No | String | A format string to use as the `_version` in OpenSearch documents. To specify a single field in an event, use `${/my_field}`. You can also use Data Prepper expressions to construct the `document_version`, for example, `${getMetadata(\"some_metadata_key\")}`. These options can be combined into more complex versions, such as `${/my_field}${getMetadata(\"some_metadata_key\")}`. The `document_version` format must evaluate to a long type and can only be used when `document_version_type` is set to either `external` or `external_gte`. +`document_version_type` | No | String | The document version type for index operations. Must be one of `external`, `external_gte`, or `internal`. If set to `external` or `external_gte`, then `document_version` is required. +`dlq_file` | No | String | The path to your preferred dead letter queue file (such as `/your/local/dlq-file`). Data Prepper writes to this file when it fails to index a document on the OpenSearch cluster. +`dlq` | No | N/A | [DLQ configurations]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/dlq/). +`bulk_size` | No | Integer (long) | The maximum size (in MiB) of bulk requests sent to the OpenSearch cluster. Values below `0` indicate an unlimited size. If a single document exceeds the maximum bulk request size, then Data Prepper sends each request individually. Default value is `5`. +`ism_policy_file` | No | String | The absolute file path for an Index State Management (ISM) policy JSON file. This policy file is effective only when there is no built-in policy file for the index type. For example, the `custom` index type is currently the only type without a built-in policy file, so it will use this policy file if it is provided through this parameter. For more information about the policy JSON file, see [ISM policies]({{site.url}}{{site.baseurl}}/im-plugin/ism/policies/). +`number_of_shards` | No | Integer | The number of primary shards that an index should have on the destination OpenSearch server. This parameter is effective only when `template_file` is either explicitly provided in the sink configuration or built in. If this parameter is set, then it will override the value in the index template file. For more information, see [Create index]({{site.url}}{{site.baseurl}}/api-reference/index-apis/create-index/). +`number_of_replicas` | No | Integer | The number of replica shards that each primary shard should have on the destination OpenSearch server. For example, if you have 4 primary shards and set `number_of_replicas` to `3`, then the index has 12 replica shards. This parameter is effective only when `template_file` is either explicitly provided in the sink configuration or built in. If this parameter is set, then it will override the value in the index template file. For more information, see [Create index]({{site.url}}{{site.baseurl}}/api-reference/index-apis/create-index/). +`distribution_version` | No | String | Indicates whether the backend version of the sink is Elasticsearch 6 or later. `es6` represents Elasticsearch 6. `default` represents the latest compatible backend version, such as Elasticsearch 7.x, OpenSearch 1.x, or OpenSearch 2.x. Default is `default`. +`enable_request_compression` | No | Boolean | Whether to enable compression when sending requests to OpenSearch. When `distribution_version` is set to `es6`, default is `false`. For all other distribution versions, default is `true`. +`action` | No | String | The OpenSearch bulk action to use for documents. Must be one of `create`, `index`, `update`, `upsert`, or `delete`. Default is `index`. +`actions` | No | List | A [list of actions](#actions) that can be used as an alternative to `action`, which reads as a switch case statement that conditionally determines the bulk action to take for an event. +`flush_timeout` | No | Long | A long class that contains the amount of time, in milliseconds, to try packing a bulk request up to the `bulk_size` before flushing the request. If this timeout expires before a bulk request has reached the `bulk_size`, the request will be flushed. Set to `-1` to disable the flush timeout and instead flush whatever is present at the end of each batch. Default is `60,000`, or 1 minute. +`normalize_index` | No | Boolean | If true, then the OpenSearch sink will try to create dynamic index names. Index names with format options specified in `${})` are valid according to the [index naming restrictions]({{site.url}}{{site.baseurl}}/api-reference/index-apis/create-index/#index-naming-restrictions). Any invalid characters will be removed. Default value is `false`. +`routing` | No | String | A string used as a hash for generating the `shard_id` for a document when it is stored in OpenSearch. Each incoming record is searched. When present, the string is used as the routing field for the document. When not present, the default routing mechanism (`document_id`) is used by OpenSearch when storing the document. Supports formatting with fields in events and [Data Prepper expressions]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/expression-syntax/), such as `${/my_field}-test-${getMetadata(\"some_metadata_key\")}`. +`document_root_key` | No | String | The key in the event that will be used as the root in the document. The default is the root of the event. If the key does not exist, then the entire event is written as the document. If `document_root_key` is of a basic value type, such as a string or integer, then the document will have a structure of `{"data": }`. +`serverless` | No | Boolean | Determines whether the OpenSearch backend is Amazon OpenSearch Serverless. Set this value to `true` when the destination for the `opensearch` sink is an Amazon OpenSearch Serverless collection. Default is `false`. +`serverless_options` | No | Object | The network configuration options available when the backend of the `opensearch` sink is set to Amazon OpenSearch Serverless. For more information, see [Serverless options](#serverless-options). + + + +## aws + Option | Required | Type | Description :--- | :--- | :--- | :--- -hosts | Yes | List | List of OpenSearch hosts to write to (for example, `["https://localhost:9200", "https://remote-cluster:9200"]`). -cert | No | String | Path to the security certificate (for example, `"config/root-ca.pem"`) if the cluster uses the OpenSearch Security plugin. -username | No | String | Username for HTTP basic authentication. -password | No | String | Password for HTTP basic authentication. -aws_sigv4 | No | Boolean | Default value is false. Whether to use AWS Identity and Access Management (IAM) signing to connect to an Amazon OpenSearch Service domain. For your access key, secret key, and optional session token, Data Prepper uses the default credential chain (environment variables, Java system properties, `~/.aws/credential`, etc.). -aws_region | No | String | The AWS region (for example, `"us-east-1"`) for the domain if you are connecting to Amazon OpenSearch Service. -aws_sts_role_arn | No | String | IAM role that the plugin uses to sign requests sent to Amazon OpenSearch Service. If this information is not provided, the plugin uses the default credentials. -[max_retries](#configure-max_retries) | No | Integer | The maximum number of times the OpenSearch sink should try to push data to the OpenSearch server before considering it to be a failure. Defaults to `Integer.MAX_VALUE`. If not provided, the sink will try to push data to the OpenSearch server indefinitely because the default value is high and exponential backoff would increase the waiting time before retry. -socket_timeout | No | Integer | The timeout, in milliseconds, waiting for data to return (or the maximum period of inactivity between two consecutive data packets). A timeout value of zero is interpreted as an infinite timeout. If this timeout value is negative or not set, the underlying Apache HttpClient would rely on operating system settings for managing socket timeouts. -connect_timeout | No | Integer | The timeout in milliseconds used when requesting a connection from the connection manager. A timeout value of zero is interpreted as an infinite timeout. If this timeout value is negative or not set, the underlying Apache HttpClient would rely on operating system settings for managing connection timeouts. -insecure | No | Boolean | Whether or not to verify SSL certificates. If set to true, certificate authority (CA) certificate verification is disabled and insecure HTTP requests are sent instead. Default value is `false`. -proxy | No | String | The address of a [forward HTTP proxy server](https://en.wikipedia.org/wiki/Proxy_server). The format is "<host name or IP>:<port>". Examples: "example.com:8100", "http://example.com:8100", "112.112.112.112:8100". Port number cannot be omitted. -index | Conditionally | String | Name of the export index. Applicable and required only when the `index_type` is `custom`. -index_type | No | String | This index type tells the Sink plugin what type of data it is handling. Valid values: `custom`, `trace-analytics-raw`, `trace-analytics-service-map`, `management-disabled`. Default value is `custom`. -template_type | No | String | Defines what type of OpenSearch template to use. The available options are `v1` and `index-template`. The default value is `v1`, which uses the original OpenSearch templates available at the `_template` API endpoints. The `index-template` option uses composable [index templates]({{site.url}}{{site.baseurl}}/opensearch/index-templates/) which are available through OpenSearch's `_index_template` API. Composable index types offer more flexibility than the default and are necessary when an OpenSearch cluster has already existing index templates. Composable templates are available for all versions of OpenSearch and some later versions of Elasticsearch. When `distribution_version` is set to `es6`, Data Prepper enforces the `template_type` as `v1`. -template_file | No | String | The path to a JSON [index template]({{site.url}}{{site.baseurl}}/opensearch/index-templates/) file such as `/your/local/template-file.json` when `index_type` is set to `custom`. For an example template file, see [otel-v1-apm-span-index-template.json](https://github.com/opensearch-project/data-prepper/blob/main/data-prepper-plugins/opensearch/src/main/resources/otel-v1-apm-span-index-template.json). If you supply a template file it must match the template format specified by the `template_type` parameter. -document_id_field | No | String | The field from the source data to use for the OpenSearch document ID (for example, `"my-field"`) if `index_type` is `custom`. -dlq_file | No | String | The path to your preferred dead letter queue file (for example, `/your/local/dlq-file`). Data Prepper writes to this file when it fails to index a document on the OpenSearch cluster. -dlq | No | N/A | DLQ configurations. See [Dead Letter Queues]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/dlq/) for details. If the `dlq_file` option is also available, the sink will fail. -bulk_size | No | Integer (long) | The maximum size (in MiB) of bulk requests sent to the OpenSearch cluster. Values below 0 indicate an unlimited size. If a single document exceeds the maximum bulk request size, Data Prepper sends it individually. Default value is 5. -ism_policy_file | No | String | The absolute file path for an ISM (Index State Management) policy JSON file. This policy file is effective only when there is no built-in policy file for the index type. For example, `custom` index type is currently the only one without a built-in policy file, thus it would use the policy file here if it's provided through this parameter. For more information, see [ISM policies]({{site.url}}{{site.baseurl}}/im-plugin/ism/policies/). -number_of_shards | No | Integer | The number of primary shards that an index should have on the destination OpenSearch server. This parameter is effective only when `template_file` is either explicitly provided in Sink configuration or built-in. If this parameter is set, it would override the value in index template file. For more information, see [Create index]({{site.url}}{{site.baseurl}}/api-reference/index-apis/create-index/). -number_of_replicas | No | Integer | The number of replica shards each primary shard should have on the destination OpenSearch server. For example, if you have 4 primary shards and set number_of_replicas to 3, the index has 12 replica shards. This parameter is effective only when `template_file` is either explicitly provided in Sink configuration or built-in. If this parameter is set, it would override the value in index template file. For more information, see [Create index]({{site.url}}{{site.baseurl}}/api-reference/index-apis/create-index/). -distribution_version | No | String | Indicates whether the sink backend version is Elasticsearch 6 or later. `es6` represents Elasticsearch 6. `default` represents the latest compatible backend version, such as Elasticsearch 7.x, OpenSearch 1.x, or OpenSearch 2.x. Default is `default`. -enable_request_compression | No | Boolean | Whether to enable compression when sending requests to OpenSearch. When `distribution_version` is set to `es6`, default is `false`. For all other distribution versions, default is `true`. -serverless | No | Boolean | Determines whether the OpenSearch backend is Amazon OpenSearch Serverless. Set this value to `true` when the destination for the `opensearch` sink is an Amazon OpenSearch Serverless collection. Default is `false`. -serverless_options | No | Object | The network configuration options available when the backend of the `opensearch` sink is set to Amazon OpenSearch Serverless. For more information, see [Serverless options](#serverless-options). - -### Serverless options +`region` | No | String | The AWS Region to use for credentials. Defaults to [standard SDK behavior to determine the Region](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/region-selection.html). +`sts_role_arn` | No | String | The AWS Security Token Service (AWS STS) role to assume for requests to Amazon SQS and Amazon S3. Defaults to `null`, which will use [standard SDK behavior for credentials](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html). +`sts_header_overrides` | No | Map | A map of header overrides that the IAM role assumes for the sink plugin. +`sts_external_id` | No | String | The external ID to attach to AssumeRole requests from AWS STS. +`serverless` | No | Boolean | **Deprecated in Data Prepper 2.7. Use this option with the `aws` configuration instead.** Determines whether the OpenSearch backend is Amazon OpenSearch Serverless. Set this value to `true` when the destination for the `opensearch` sink is an Amazon OpenSearch Serverless collection. Default is `false`. +`serverless_options` | No | Object | **Deprecated in Data Prepper 2.7. Use this option with the `aws` configuration instead.** The network configuration options available when the backend of the `opensearch` sink is set to Amazon OpenSearch Serverless. For more information, see [Serverless options](#serverless-options). + + +## actions + + +The following options can be used inside the `actions` option. + +Option | Required | Type | Description +:--- |:---| :--- | :--- +`type` | Yes | String | The type of bulk action to use if the `when` condition evaluates to true. Must be either `create`, `index`, `update`, `upsert`, or `delete`. +`when` | No | String | A [Data Prepper expression]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/expression-syntax/) that conditionally evaluates whether an event will be sent to OpenSearch using the bulk action configured in `type`. When empty, the bulk action will be chosen automatically when the event is sent to OpenSearch. + + +## Serverless options The following options can be used in the `serverless_options` object. Option | Required | Type | Description :--- | :--- | :---| :--- -network_policy_name | Yes | String | The name of the network policy to create. -collection_name | Yes | String | The name of the Amazon OpenSearch Serverless collection to configure. -vpce_id | Yes | String | The virtual private cloud (VPC) endpoint to which the source connects. +`network_policy_name` | Yes | String | The name of the network policy to create. +`collection_name` | Yes | String | The name of the Amazon OpenSearch Serverless collection to configure. +`vpce_id` | Yes | String | The virtual private cloud (VPC) endpoint to which the source connects. ### Configure max_retries @@ -191,7 +228,6 @@ If your domain uses a master user in the internal user database, specify the mas sink: opensearch: hosts: ["https://your-fgac-amazon-opensearch-service-endpoint"] - aws_sigv4: false username: "master-username" password: "master-password" ``` @@ -302,3 +338,53 @@ log-pipeline: sts_role_arn: "arn:aws:iam:::role/PipelineRole" region: "us-east-1" ``` + +### Example with template_content and actions + +The following example pipeline contains both `template_content` and a list of conditional `actions`: + +```yaml +log-pipeline: + source: + http: + processor: + - date: + from_time_received: true + destination: "@timestamp" + sink: + - opensearch: + hosts: [ "https://" ] + index: "my-serverless-index" + template_type: index-template + template_content: > + { + "template" : { + "mappings" : { + "properties" : { + "Data" : { + "type" : "binary" + }, + "EncodedColors" : { + "type" : "binary" + }, + "Type" : { + "type" : "keyword" + }, + "LargeDouble" : { + "type" : "double" + } + } + } + } + } + # index is the default case + actions: + - type: "delete" + when: '/operation == "delete"' + - type: "update" + when: '/operation == "update"' + - type: "index" + aws: + sts_role_arn: "arn:aws:iam:::role/PipelineRole" + region: "us-east-1" +``` diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index cb881e814a..c752bf6b3d 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -8,7 +8,22 @@ nav_order: 55 # s3 -The `s3` sink saves batches of events to [Amazon Simple Storage Service (Amazon S3)](https://aws.amazon.com/s3/) objects. +The `s3` sink saves and writes batches of Data Prepper events to Amazon Simple Storage Service (Amazon S3) objects. The configured `codec` determines how the `s3` sink serializes the data into Amazon S3. + +The `s3` sink uses the following format when batching events: + +``` +${pathPrefix}events-%{yyyy-MM-dd'T'HH-mm-ss'Z'}-${currentTimeInNanos}-${uniquenessId}.${codecSuppliedExtension} +``` + +When a batch of objects is written to S3, the objects are formatted similarly to the following: + +``` +my-logs/2023/06/09/06/events-2023-06-09T06-00-01-1686290401871214927-ae15b8fa-512a-59c2-b917-295a0eff97c8.json +``` + + +For more information about how to configure an object, see the [Object key](#object-key-configuration) section. ## Usage @@ -22,14 +37,12 @@ pipeline: aws: region: us-east-1 sts_role_arn: arn:aws:iam::123456789012:role/Data-Prepper - sts_header_overrides: max_retries: 5 - bucket: - name: bucket_name - object_key: - path_prefix: my-elb/%{yyyy}/%{MM}/%{dd}/ + bucket: mys3bucket + object_key: + path_prefix: my-logs/%{yyyy}/%{MM}/%{dd}/ threshold: - event_count: 2000 + event_count: 10000 maximum_size: 50mb event_collect_timeout: 15s codec: @@ -37,17 +50,37 @@ pipeline: buffer_type: in_memory ``` +## IAM permissions + +In order to use the `s3` sink, configure AWS Identity and Access Management (IAM) to grant Data Prepper permissions to write to Amazon S3. You can use a configuration similar to the following JSON configuration: + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "s3-access", + "Effect": "Allow", + "Action": [ + "s3:PutObject" + ], + "Resource": "arn:aws:s3:::/*" + } + ] +} +``` + ## Configuration Use the following options when customizing the `s3` sink. Option | Required | Type | Description :--- | :--- | :--- | :--- -`bucket` | Yes | String | The object from which the data is retrieved and then stored. The `name` must match the name of your object store. -`codec` | Yes | [Buffer type](#buffer-type) | Determines the buffer type. +`bucket` | Yes | String | The name of the S3 bucket to which the sink writes. +`codec` | Yes | [Codec](#codec) | The codec that determines how the data is serialized in the S3 object. `aws` | Yes | AWS | The AWS configuration. See [aws](#aws) for more information. `threshold` | Yes | [Threshold](#threshold-configuration) | Configures when to write an object to S3. -`object_key` | No | Sets the `path_prefix` and the `file_pattern` of the object store. Defaults to the S3 object `events-%{yyyy-MM-dd'T'hh-mm-ss}` found inside the root directory of the bucket. +`object_key` | No | [Object key](#object-key-configuration) | Sets the `path_prefix` of the object in S3. Defaults to the S3 object `events-%{yyyy-MM-dd'T'hh-mm-ss}` found in the root directory of the bucket. `compression` | No | String | The compression algorithm to apply: `none`, `gzip`, or `snappy`. Default is `none`. `buffer_type` | No | [Buffer type](#buffer-type) | Determines the buffer type. `max_retries` | No | Integer | The maximum number of times a single request should retry when ingesting data to S3. Defaults to `5`. @@ -59,33 +92,34 @@ Option | Required | Type | Description `region` | No | String | The AWS Region to use for credentials. Defaults to [standard SDK behavior to determine the Region](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/region-selection.html). `sts_role_arn` | No | String | The AWS Security Token Service (AWS STS) role to assume for requests to Amazon SQS and Amazon S3. Defaults to `null`, which will use the [standard SDK behavior for credentials](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html). `sts_header_overrides` | No | Map | A map of header overrides that the IAM role assumes for the sink plugin. -`sts_external_id` | No | String | The external ID to attach to AssumeRole requests from AWS STS. +`sts_external_id` | No | String | An STS external ID used when Data Prepper assumes the role. For more information, see the `ExternalId` documentation in the [STS AssumeRole](https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html) API reference. + ## Threshold configuration -Use the following options to set ingestion thresholds for the `s3` sink. +Use the following options to set ingestion thresholds for the `s3` sink. When any of these conditions are met, Data Prepper will write events to an S3 object. Option | Required | Type | Description :--- | :--- | :--- | :--- -`event_count` | Yes | Integer | The maximum number of events the S3 bucket can ingest. -`maximum_size` | Yes | String | The maximum number of bytes that the S3 bucket can ingest after compression. Defaults to `50mb`. -`event_collect_timeout` | Yes | String | Sets the time period during which events are collected before ingestion. All values are strings that represent duration, either an ISO_8601 notation string, such as `PT20.345S`, or a simple notation, such as `60s` or `1500ms`. +`event_count` | Yes | Integer | The number of Data Prepper events to accumulate before writing an object to S3. +`maximum_size` | No | String | The maximum number of bytes to accumulate before writing an object to S3. Default is `50mb`. +`event_collect_timeout` | Yes | String | The maximum amount of time before Data Prepper writes an event to S3. The value should be either an ISO-8601 duration, such as `PT2M30S`, or a simple notation, such as `60s` or `1500ms`. ## Buffer type -`buffer_type` is an optional configuration that records stored events temporarily before flushing them into an S3 bucket. The default value is `in_memory`. Use one of the following options: +`buffer_type` is an optional configuration that determines how Data Prepper temporarily stores data before writing an object to S3. The default value is `in_memory`. Use one of the following options: - `in_memory`: Stores the record in memory. -- `local_file`: Flushes the record into a file on your machine. +- `local_file`: Flushes the record into a file on your local machine. This uses your machine's temporary directory. - `multipart`: Writes using the [S3 multipart upload](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html). Every 10 MB is written as a part. ## Object key configuration Option | Required | Type | Description :--- | :--- | :--- | :--- -`path_prefix` | Yes | String | The S3 key prefix path to use. Accepts date-time formatting. For example, you can use `%{yyyy}/%{MM}/%{dd}/%{HH}/` to create hourly folders in S3. By default, events write to the root of the bucket. +`path_prefix` | No | String | The S3 key prefix path to use for objects written to S3. Accepts date-time formatting. For example, you can use `%{yyyy}/%{MM}/%{dd}/%{HH}/` to create hourly folders in S3. The prefix path should end with `/`. By default, Data Prepper writes objects to the root of the S3 bucket. ## codec @@ -156,3 +190,49 @@ Option | Required | Type | Description `schema` | Yes | String | The Avro [schema declaration](https://avro.apache.org/docs/current/specification/#schema-declaration). Not required if `auto_schema` is set to true. `auto_schema` | No | Boolean | When set to `true`, automatically generates the Avro [schema declaration](https://avro.apache.org/docs/current/specification/#schema-declaration) from the first event. +### Setting a schema with Parquet + +The following example shows you how to configure the `s3` sink to write Parquet data into a Parquet file using a schema for [VPC Flow Logs](https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html#flow-log-records): + +``` +pipeline: + ... + sink: + - s3: + aws: + region: us-east-1 + sts_role_arn: arn:aws:iam::123456789012:role/Data-Prepper + bucket: mys3bucket + object_key: + path_prefix: vpc-flow-logs/%{yyyy}/%{MM}/%{dd}/%{HH}/ + codec: + parquet: + schema: > + { + "type" : "record", + "namespace" : "org.opensearch.dataprepper.examples", + "name" : "VpcFlowLog", + "fields" : [ + { "name" : "version", "type" : ["null", "string"]}, + { "name" : "srcport", "type": ["null", "int"]}, + { "name" : "dstport", "type": ["null", "int"]}, + { "name" : "accountId", "type" : ["null", "string"]}, + { "name" : "interfaceId", "type" : ["null", "string"]}, + { "name" : "srcaddr", "type" : ["null", "string"]}, + { "name" : "dstaddr", "type" : ["null", "string"]}, + { "name" : "start", "type": ["null", "int"]}, + { "name" : "end", "type": ["null", "int"]}, + { "name" : "protocol", "type": ["null", "int"]}, + { "name" : "packets", "type": ["null", "int"]}, + { "name" : "bytes", "type": ["null", "int"]}, + { "name" : "action", "type": ["null", "string"]}, + { "name" : "logStatus", "type" : ["null", "string"]} + ] + } + threshold: + event_count: 500000000 + maximum_size: 20mb + event_collect_timeout: PT15M + buffer_type: in_memory +``` + diff --git a/_data-prepper/pipelines/configuration/sources/dynamo-db.md b/_data-prepper/pipelines/configuration/sources/dynamo-db.md index 597e835151..f75489f103 100644 --- a/_data-prepper/pipelines/configuration/sources/dynamo-db.md +++ b/_data-prepper/pipelines/configuration/sources/dynamo-db.md @@ -31,6 +31,7 @@ cdc-pipeline: s3_prefix: "myprefix" stream: start_position: "LATEST" # Read latest data from streams (Default) + view_on_remove: NEW_IMAGE aws: region: "us-west-2" sts_role_arn: "arn:aws:iam::123456789012:role/my-iam-role" @@ -84,12 +85,112 @@ Option | Required | Type | Description The following option lets you customize how the pipeline reads events from the DynamoDB table. -Option | Required | Type | Description +Option | Required | Type | Description :--- | :--- | :--- | :--- `start_position` | No | String | The position from where the source starts reading stream events when the DynamoDB stream option is enabled. `LATEST` starts reading events from the most recent stream record. +`view_on_remove` | No | Enum | The [stream record view](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html) to use for REMOVE events from DynamoDB streams. Must be either `NEW_IMAGE` or `OLD_IMAGE` . Defaults to `NEW_IMAGE`. If the `OLD_IMAGE` option is used and the old image can not be found, the source will find the `NEW_IMAGE`. + +## Exposed metadata attributes + +The following metadata will be added to each event that is processed by the `dynamodb` source. These metadata attributes can be accessed using the [expression syntax `getMetadata` function](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/#getmetadata). + +* `primary_key`: The primary key of the DynamoDB item. For tables that only contain a partition key, this value provides the partition key. For tables that contain both a partition and sort key, the `primary_key` attribute will be equal to the partition and sort key, separated by a `|`, for example, `partition_key|sort_key`. +* `partition_key`: The partition key of the DynamoDB item. +* `sort_key`: The sort key of the DynamoDB item. This will be null if the table does not contain a sort key. +* `dynamodb_timestamp`: The timestamp of the DynamoDB item. This will be the export time for export items and the DynamoDB stream event time for stream items. This timestamp is used by sinks to emit an `EndtoEndLatency` metric for DynamoDB stream events that tracks the latency between a change occurring in the DynamoDB table and that change being applied to the sink. +* `document_version`: Uses the `dynamodb_timestamp` to modify break ties between stream items that are received in the same second. Recommend for use with the `opensearch` sink's `document_version` setting. +* `opensearch_action`: A default value for mapping DynamoDB event actions to OpenSearch actions. This action will be `index` for export items, and `INSERT` or `MODIFY` for stream events, and `REMOVE` stream events when the OpenSearch action is `delete`. +* `dynamodb_event_name`: The exact event type for the item. Will be `null` for export items and either `INSERT`, `MODIFY`, or `REMOVE` for stream events. +* `table_name`: The name of the DynamoDB table that an event came from. + + +## Permissions + +The following are the minimum required permissions for running DynamoDB as a source: + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "allowDescribeTable", + "Effect": "Allow", + "Action": [ + "dynamodb:DescribeTable" + ], + "Resource": [ + "arn:aws:dynamodb:us-east-1:{account-id}:table/my-table" + ] + }, + { + "Sid": "allowRunExportJob", + "Effect": "Allow", + "Action": [ + "dynamodb:DescribeContinuousBackups", + "dynamodb:ExportTableToPointInTime" + ], + "Resource": [ + "arn:aws:dynamodb:us-east-1:{account-id}:table/my-table" + ] + }, + { + "Sid": "allowCheckExportjob", + "Effect": "Allow", + "Action": [ + "dynamodb:DescribeExport" + ], + "Resource": [ + "arn:aws:dynamodb:us-east-1:{account-id}:table/my-table/export/*" + ] + }, + { + "Sid": "allowReadFromStream", + "Effect": "Allow", + "Action": [ + "dynamodb:DescribeStream", + "dynamodb:GetRecords", + "dynamodb:GetShardIterator" + ], + "Resource": [ + "arn:aws:dynamodb:us-east-1:{account-id}:table/my-table/stream/*" + ] + }, + { + "Sid": "allowReadAndWriteToS3ForExport", + "Effect": "Allow", + "Action": [ + "s3:GetObject", + "s3:AbortMultipartUpload", + "s3:PutObject", + "s3:PutObjectAcl" + ], + "Resource": [ + "arn:aws:s3:::my-bucket/*" + ] + } + ] +} +``` + +When performing an export, the `"Sid": "allowReadFromStream"` section is not required. If only reading from DynamoDB streams, the +`"Sid": "allowReadAndWriteToS3ForExport"`, `"Sid": "allowCheckExportjob"`, and ` "Sid": "allowRunExportJob"` sections are not required. + +## Metrics +The `dynamodb` source includes the following metrics. +### Counters +* `exportJobSuccess`: The number of export jobs that have been submitted successfully. +* `exportJobFailure`: The number of export job submission attempts that have failed. +* `exportS3ObjectsTotal`: The total number of export data files found in S3. +* `exportS3ObjectsProcessed`: The total number of export data files that have been processed successfully from S3. +* `exportRecordsTotal`: The total number of records found in the export. +* `exportRecordsProcessed`: The total number of export records that have been processed successfully. +* `exportRecordsProcessingErrors`: The number of export record processing errors. +* `changeEventsProcessed`: The number of change events processed from DynamoDB streams. +* `changeEventsProcessingErrors`: The number of processing errors for change events from DynamoDB streams. +* `shardProgress`: The incremented shard progress when DynamoDB streams are being read correctly. This being`0` for any significant amount of time means there is a problem with the pipeline that has streams enabled. diff --git a/_data-prepper/pipelines/configuration/sources/s3.md b/_data-prepper/pipelines/configuration/sources/s3.md index ad5de6884d..7a3746bab6 100644 --- a/_data-prepper/pipelines/configuration/sources/s3.md +++ b/_data-prepper/pipelines/configuration/sources/s3.md @@ -8,7 +8,10 @@ nav_order: 20 # s3 source -`s3` is a source plugin that reads events from [Amazon Simple Storage Service (Amazon S3)](https://aws.amazon.com/s3/) objects. It requires an [Amazon Simple Queue Service (Amazon SQS)](https://aws.amazon.com/sqs/) queue that receives [S3 Event Notifications](https://docs.aws.amazon.com/AmazonS3/latest/userguide/NotificationHowTo.html). After Amazon SQS is configured, the `s3` source receives messages from Amazon SQS. When the SQS message indicates that an S3 object was created, the `s3` source loads the S3 objects and then parses them using the configured [codec](#codec). You can also configure the `s3` source to use [Amazon S3 Select](https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html) instead of Data Prepper to parse S3 objects. +`s3` is a source plugin that reads events from [Amazon Simple Storage Service (Amazon S3)](https://aws.amazon.com/s3/) objects. You can configure the source to either use an [Amazon Simple Queue Service (Amazon SQS)](https://aws.amazon.com/sqs/) queue or scan an S3 bucket: + +- To use Amazon SQS notifications, configure S3 event notifications on your S3 bucket. After Amazon SQS is configured, the `s3` source receives messages from Amazon SQS. When the SQS message indicates that an S3 object has been created, the `s3` source loads the S3 objects and then parses them using the configured [codec](#codec). +- To use an S3 bucket, configure the `s3` source to use Amazon S3 Select instead of Data Prepper to parse S3 objects. ## IAM permissions @@ -86,20 +89,23 @@ Option | Required | Type | Description :--- | :--- | :--- | :--- `notification_type` | Yes | String | Must be `sqs`. `notification_source` | No | String | Determines how notifications are received by SQS. Must be `s3` or `eventbridge`. `s3` represents notifications that are directly sent from Amazon S3 to Amazon SQS or fanout notifications from Amazon S3 to Amazon Simple Notification Service (Amazon SNS) to Amazon SQS. `eventbridge` represents notifications from [Amazon EventBridge](https://aws.amazon.com/eventbridge/) and [Amazon Security Lake](https://aws.amazon.com/security-lake/). Default is `s3`. -`compression` | No | String | The compression algorithm to apply: `none`, `gzip`, or `automatic`. Default is `none`. +`compression` | No | String | The compression algorithm to apply: `none`, `gzip`, `snappy`, or `automatic`. Default is `none`. `codec` | Yes | Codec | The [codec](#codec) to apply. `sqs` | Yes | SQS | The SQS configuration. See [sqs](#sqs) for more information. `aws` | Yes | AWS | The AWS configuration. See [aws](#aws) for more information. `on_error` | No | String | Determines how to handle errors in Amazon SQS. Can be either `retain_messages` or `delete_messages`. `retain_messages` leaves the message in the Amazon SQS queue and tries to send the message again. This is recommended for dead-letter queues. `delete_messages` deletes failed messages. Default is `retain_messages`. -buffer_timeout | No | Duration | The amount of time allowed for writing events to the Data Prepper buffer before timeout occurs. Any events that the Amazon S3 source cannot write to the buffer during the set amount of time are discarded. Default is `10s`. +`buffer_timeout` | No | Duration | The amount of time allowed for writing events to the Data Prepper buffer before timeout occurs. Any events that the Amazon S3 source cannot write to the buffer during the specified amount of time are discarded. Default is `10s`. `records_to_accumulate` | No | Integer | The number of messages that accumulate before being written to the buffer. Default is `100`. `metadata_root_key` | No | String | The base key for adding S3 metadata to each event. The metadata includes the key and bucket for each S3 object. Default is `s3/`. +`default_bucket_owner` | No | String | The AWS account ID for the owner of an S3 bucket. For more information, see [Cross-account S3 access](#s3_bucket_ownership). +`bucket_owners` | No | Map | A map of bucket names that includes the IDs of the accounts that own the buckets. For more information, see [Cross-account S3 access](#s3_bucket_ownership). `disable_bucket_ownership_validation` | No | Boolean | When `true`, the S3 source does not attempt to validate that the bucket is owned by the expected account. The expected account is the same account that owns the Amazon SQS queue. Default is `false`. `acknowledgments` | No | Boolean | When `true`, enables `s3` sources to receive [end-to-end acknowledgments]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/pipelines#end-to-end-acknowledgments) when events are received by OpenSearch sinks. `s3_select` | No | [s3_select](#s3_select) | The Amazon S3 Select configuration. `scan` | No | [scan](#scan) | The S3 scan configuration. `delete_s3_objects_on_read` | No | Boolean | When `true`, the S3 scan attempts to delete S3 objects after all events from the S3 object are successfully acknowledged by all sinks. `acknowledgments` should be enabled when deleting S3 objects. Default is `false`. -`workers` | No | Integer | The number of worker threads. Default is `1`, with a min of `1` and a max of `1000`. Each worker thread subscribes to Amazon SQS messages. When a worker receives an SQS message, that worker processes the message independently from the other workers. +`workers` | No | Integer | Configures the number of worker threads that the source uses to read data from S3. Leaving this value at the default unless your S3 objects are less than 1MB. Performance may decrease for larger S3 objects. This setting only affects SQS-based sources. Default is `1`. + ## sqs @@ -113,7 +119,7 @@ Option | Required | Type | Description `visibility_timeout` | No | Duration | The visibility timeout to apply to messages read from the Amazon SQS queue. This should be set to the amount of time that Data Prepper may take to read all the S3 objects in a batch. Default is `30s`. `wait_time` | No | Duration | The amount of time to wait for long polling on the Amazon SQS API. Default is `20s`. `poll_delay` | No | Duration | A delay placed between the reading and processing of a batch of Amazon SQS messages and making a subsequent request. Default is `0s`. -`visibility_duplication_protection` | No | Boolean | If set to `true`, Data Prepper attempts to avoid duplicate processing by extending the visibility timeout of SQS messages. Until the data reaches the sink, Data Prepper will regularly call `ChangeMessageVisibility` to avoid reading the S3 object again. To use this feature, you need to grant permissions to `ChangeMessageVisibility` on the IAM role. Default is `false`. +`visibility_duplication_protection` | No | Boolean | If set to `true`, Data Prepper attempts to avoid duplicate processing by extending the visibility timeout of SQS messages. Until the data reaches the sink, Data Prepper will regularly call `ChangeMessageVisibility` to avoid rereading of the S3 object. To use this feature, you need to grant permissions to `sqs:ChangeMessageVisibility` on the IAM role. Default is `false`. `visibility_duplicate_protection_timeout` | No | Duration | Sets the maximum total length of time that a message will not be processed when using `visibility_duplication_protection`. Defaults to two hours. @@ -124,6 +130,7 @@ Option | Required | Type | Description `region` | No | String | The AWS Region to use for credentials. Defaults to [standard SDK behavior to determine the Region](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/region-selection.html). `sts_role_arn` | No | String | The AWS Security Token Service (AWS STS) role to assume for requests to Amazon SQS and Amazon S3. Defaults to `null`, which will use the [standard SDK behavior for credentials](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html). `aws_sts_header_overrides` | No | Map | A map of header overrides that the IAM role assumes for the sink plugin. +`sts_external_id` | No | String | An STS external ID used when Data Prepper assumes the STS role. For more information, see the `ExternalID` documentation in the [STS AssumeRole](https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html) API reference. ## codec @@ -155,9 +162,6 @@ Option | Required | Type | Description `header` | No | String list | The header containing the column names used to parse CSV data. `detect_header` | No | Boolean | Whether the first line of the Amazon S3 object should be interpreted as a header. Default is `true`. - - - ## Using `s3_select` with the `s3` source When configuring `s3_select` to parse Amazon S3 objects, use the following options: @@ -199,16 +203,18 @@ Option | Required | Type | Description `start_time` | No | String | The time from which to start scanning objects modified after the given `start_time`. This should follow [ISO LocalDateTime](https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html#ISO_LOCAL_DATE_TIME) format, for example, `023-01-23T10:00:00`. If `end_time` is configured along with `start_time`, all objects after `start_time` and before `end_time` will be processed. `start_time` and `range` cannot be used together. `end_time` | No | String | The time after which no objects will be scanned after the given `end_time`. This should follow [ISO LocalDateTime](https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html#ISO_LOCAL_DATE_TIME) format, for example, `023-01-23T10:00:00`. If `start_time` is configured along with `end_time`, all objects after `start_time` and before `end_time` will be processed. `end_time` and `range` cannot be used together. `range` | No | String | The time range from which objects are scanned from all buckets. Supports ISO_8601 notation strings, such as `PT20.345S` or `PT15M`, and notation strings for seconds (`60s`) and milliseconds (`1600ms`). `start_time` and `end_time` cannot be used with `range`. Range `P12H` scans all the objects modified in the last 12 hours from the time pipeline started. -`buckets` | Yes | List | A list of [buckets](#bucket) to scan. +`buckets` | Yes | List | A list of [scan buckets](#scan-bucket) to scan. `scheduling` | No | List | The configuration for scheduling periodic scans on all buckets. `start_time`, `end_time` and `range` can not be used if scheduling is configured. -### bucket + +### scan bucket + Option | Required | Type | Description :--- | :--- |:-----| :--- `bucket` | Yes | Map | Provides options for each bucket. -You can configure the following options inside the [bucket](#bucket) setting. +You can configure the following options in the `bucket` setting map. Option | Required | Type | Description :--- | :--- | :--- | :--- @@ -245,13 +251,17 @@ The `s3` source includes the following metrics: * `s3ObjectsNotFound`: The number of S3 objects that the `s3` source failed to read due to an S3 "Not Found" error. These are also counted toward `s3ObjectsFailed`. * `s3ObjectsAccessDenied`: The number of S3 objects that the `s3` source failed to read due to an "Access Denied" or "Forbidden" error. These are also counted toward `s3ObjectsFailed`. * `s3ObjectsSucceeded`: The number of S3 objects that the `s3` source successfully read. +* `s3ObjectNoRecordsFound`: The number of S3 objects that resulted in 0 records being added to the buffer by the `s3` source. +* `s3ObjectsDeleted`: The number of S3 objects deleted by the `s3` source. +* `s3ObjectsDeleteFailed`: The number of S3 objects that the `s3` source failed to delete. +* `s3ObjectsEmpty`: The number of S3 objects that are considered empty because they have a size of `0`. These objects will be skipped by the `s3` source. * `sqsMessagesReceived`: The number of Amazon SQS messages received from the queue by the `s3` source. * `sqsMessagesDeleted`: The number of Amazon SQS messages deleted from the queue by the `s3` source. * `sqsMessagesFailed`: The number of Amazon SQS messages that the `s3` source failed to parse. -* `s3ObjectNoRecordsFound` -- The number of S3 objects that resulted in 0 records added to the buffer by the `s3` source. * `sqsMessagesDeleteFailed` -- The number of SQS messages that the `s3` source failed to delete from the SQS queue. -* `s3ObjectsDeleted` -- The number of S3 objects deleted by the `s3` source. -* `s3ObjectsDeleteFailed` -- The number of S3 objects that the `s3` source failed to delete. +* `sqsVisibilityTimeoutChangedCount`: The number of times that the `s3` source changed the visibility timeout for an SQS message. This includes multiple visibility timeout changes on the same message. +* `sqsVisibilityTimeoutChangeFailedCount`: The number of times that the `s3` source failed to change the visibility timeout for an SQS message. This includes multiple visibility timeout change failures on the same message. +* `acknowledgementSetCallbackCounter`: The number of times that the `s3` source received an acknowledgment from Data Prepper. ### Timers diff --git a/_data-prepper/pipelines/expression-syntax.md b/_data-prepper/pipelines/expression-syntax.md index 8257ab8978..be0be6f792 100644 --- a/_data-prepper/pipelines/expression-syntax.md +++ b/_data-prepper/pipelines/expression-syntax.md @@ -230,7 +230,7 @@ The `length()` function takes one argument of the JSON pointer type and returns ### `hasTags()` -The `hastags()` function takes one or more string type arguments and returns `true` if all the arguments passed are present in an event's tags. When an argument does not exist in the event's tags, the function returns `false`. For example, if you use the expression `hasTags("tag1")` and the event contains `tag1`, Data Prepper returns `true`. If you use the expression `hasTags("tag2")` but the event only contains a `tag1` tag, Data Prepper returns `false`. +The `hasTags()` function takes one or more string type arguments and returns `true` if all of the arguments passed are present in an event's tags. When an argument does not exist in the event's tags, the function returns `false`. For example, if you use the expression `hasTags("tag1")` and the event contains `tag1`, Data Prepper returns `true`. If you use the expression `hasTags("tag2")` but the event only contains `tag1`, Data Prepper returns `false`. ### `getMetadata()` @@ -245,3 +245,21 @@ The `contains()` function takes two string arguments and determines whether eith The `cidrContains()` function takes two or more arguments. The first argument is a JSON pointer, which represents the key to the IP address that is checked. It supports both IPv4 and IPv6 addresses. Every argument that comes after the key is a string type that represents CIDR blocks that are checked against. If the IP address in the first argument is in the range of any of the given CIDR blocks, the function returns `true`. If the IP address is not in the range of the CIDR blocks, the function returns `false`. For example, `cidrContains(/sourceIp,"192.0.2.0/24","10.0.1.0/16")` will return `true` if the `sourceIp` field indicated in the JSON pointer has a value of `192.0.2.5`. + +### `join()` + +The `join()` function joins elements of a list to form a string. The function takes a JSON pointer, which represents the key to a list or a map where values are of the list type, and joins the lists as strings using commas (`,`), the default delimiter between strings. + +If `{"source": [1, 2, 3]}` is the input data, as shown in the following example: + + +```json +{"source": {"key1": [1, 2, 3], "key2": ["a", "b", "c"]}} +``` + +Then `join(/source)` will return `"1,2,3"` in the following format: + +```json +{"key1": "1,2,3", "key2": "a,b,c"} +``` +You can also specify a delimiter other than the default inside the expression. For example, `join("-", /source)` joins each `source` field using a hyphen (`-`) as the delimiter. From 04e9b1b07693c9cec859c9eb322436458eda7686 Mon Sep 17 00:00:00 2001 From: Heather Halter Date: Thu, 28 Mar 2024 08:42:31 -0700 Subject: [PATCH 11/18] Update CODEOWNERS (#6799) Signed-off-by: Heather Halter --- .github/CODEOWNERS | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index b5fb80d3aa..0ec6c5e009 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -1 +1 @@ -* @hdhalter @kolchfa-aws @Naarcha-AWS @vagimeli @AMoo-Miki @natebower @dlvenable @scrawfor99 +* @hdhalter @kolchfa-aws @Naarcha-AWS @vagimeli @AMoo-Miki @natebower @dlvenable @scrawfor99 @epugh From 1711edd0324058217852011a13c517f6947c942e Mon Sep 17 00:00:00 2001 From: Heather Halter Date: Thu, 28 Mar 2024 08:42:54 -0700 Subject: [PATCH 12/18] Added Eric as maintainer (#6798) Signed-off-by: Heather Halter --- MAINTAINERS.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/MAINTAINERS.md b/MAINTAINERS.md index 921e46ab09..1bf2a1d219 100644 --- a/MAINTAINERS.md +++ b/MAINTAINERS.md @@ -1,6 +1,6 @@ ## Overview -This document contains a list of maintainers in this repo. See [opensearch-project/.github/RESPONSIBILITIES.md](https://github.com/opensearch-project/.github/blob/main/RESPONSIBILITIES.md#maintainer-responsibilities) that explains what the role of maintainer means, what maintainers do in this and other repos, and how they should be doing it. If you're interested in contributing, and becoming a maintainer, see [CONTRIBUTING](CONTRIBUTING.md). +This document lists the maintainers in this repo. See [opensearch-project/.github/RESPONSIBILITIES.md](https://github.com/opensearch-project/.github/blob/main/RESPONSIBILITIES.md#maintainer-responsibilities) for information about the role of a maintainer, what maintainers do in this and other repos, and how they should be doing it. If you're interested in contributing or becoming a maintainer, see [CONTRIBUTING](CONTRIBUTING.md). ## Current Maintainers @@ -9,8 +9,9 @@ This document contains a list of maintainers in this repo. See [opensearch-proje | Heather Halter | [hdhalter](https://github.com/hdhalter) | Amazon | | Fanit Kolchina | [kolchfa-aws](https://github.com/kolchfa-aws) | Amazon | | Nate Archer | [Naarcha-AWS](https://github.com/Naarcha-AWS) | Amazon | -| Nate Bower | [natebower](https://github.com/natebower) | Amazon | +| Nathan Bower | [natebower](https://github.com/natebower) | Amazon | | Melissa Vagi | [vagimeli](https://github.com/vagimeli) | Amazon | | Miki Barahmand | [AMoo-Miki](https://github.com/AMoo-Miki) | Amazon | | David Venable | [dlvenable](https://github.com/dlvenable) | Amazon | | Stephen Crawford | [scraw99](https://github.com/scrawfor99) | Amazon | +| Eric Pugh | [epugh](https://github.com/epugh) | OpenSource Connections | From 13a35378cd83176290f8a032d95b8f87f6c66cfb Mon Sep 17 00:00:00 2001 From: landon-l8 <137821564+landon-l8@users.noreply.github.com> Date: Thu, 28 Mar 2024 09:50:57 -0600 Subject: [PATCH 13/18] Update reindex.md (#6760) Added dest > pipeline to the documentation Signed-off-by: landon-l8 <137821564+landon-l8@users.noreply.github.com> --- _api-reference/document-apis/reindex.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/_api-reference/document-apis/reindex.md b/_api-reference/document-apis/reindex.md index 766f5b2872..4a0346ede3 100644 --- a/_api-reference/document-apis/reindex.md +++ b/_api-reference/document-apis/reindex.md @@ -73,10 +73,11 @@ slice | Whether to manually or automatically slice the reindex operation so it e _source | Whether to reindex source fields. Specify a list of fields to reindex or true to reindex all fields. Default is true. id | The ID to associate with manual slicing. max | Maximum number of slices. -dest | Information about the destination index. Valid values are `index`, `version_type`, and `op_type`. +dest | Information about the destination index. Valid values are `index`, `version_type`, `op_type`, and `pipeline`. index | Name of the destination index. version_type | The indexing operation's version type. Valid values are `internal`, `external`, `external_gt` (retrieve the document if the specified version number is greater than the document’s current version), and `external_gte` (retrieve the document if the specified version number is greater or equal to than the document’s current version). op_type | Whether to copy over documents that are missing in the destination index. Valid values are `create` (ignore documents with the same ID from the source index) and `index` (copy everything from the source index). +pipeline | Which ingest pipeline to utilize during the reindex. script | A script that OpenSearch uses to apply transformations to the data during the reindex operation. source | The actual script that OpenSearch runs. lang | The scripting language. Valid options are `painless`, `expression`, `mustache`, and `java`. From 6f8261ba165b8ff59780addf3e27ff1c7e6a6997 Mon Sep 17 00:00:00 2001 From: leanneeliatra <131779422+leanneeliatra@users.noreply.github.com> Date: Thu, 28 Mar 2024 15:59:13 +0000 Subject: [PATCH 14/18] Adding explanation for editing permissions 20230825 (#6606) * adding explination for editing permissions Signed-off-by: leanne.laceybyrne@eliatra.com * changed to a h3 to see if review dog will accept Signed-off-by: leanne.laceybyrne@eliatra.com * Update _security/access-control/document-level-security.md Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Signed-off-by: leanneeliatra <131779422+leanneeliatra@users.noreply.github.com> * Update _security/access-control/document-level-security.md Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Signed-off-by: leanneeliatra <131779422+leanneeliatra@users.noreply.github.com> * Update _security/access-control/document-level-security.md Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Signed-off-by: leanneeliatra <131779422+leanneeliatra@users.noreply.github.com> * Update _security/access-control/document-level-security.md Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Signed-off-by: leanneeliatra <131779422+leanneeliatra@users.noreply.github.com> * Updates to both the users roles and DLS sections to reflect need to add edit DLS section Signed-off-by: leanne.laceybyrne@eliatra.com * updating after reviewdog comments Signed-off-by: leanne.laceybyrne@eliatra.com * updating roles in OpenSearch updates Signed-off-by: leanne.laceybyrne@eliatra.com * Apply suggestions from code review Updates following review Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Signed-off-by: leanneeliatra <131779422+leanneeliatra@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update document-level-security.md Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> * Update _security/access-control/document-level-security.md Co-authored-by: Nathan Bower Signed-off-by: leanneeliatra <131779422+leanneeliatra@users.noreply.github.com> --------- Signed-off-by: leanne.laceybyrne@eliatra.com Signed-off-by: leanneeliatra <131779422+leanneeliatra@users.noreply.github.com> Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Nathan Bower --- .../access-control/document-level-security.md | 49 ++++++++++--------- _security/access-control/users-roles.md | 35 +++++++++++++ 2 files changed, 60 insertions(+), 24 deletions(-) diff --git a/_security/access-control/document-level-security.md b/_security/access-control/document-level-security.md index 3f2049a1e2..be5fe7e0da 100644 --- a/_security/access-control/document-level-security.md +++ b/_security/access-control/document-level-security.md @@ -10,30 +10,31 @@ redirect_from: # Document-level security (DLS) -Document-level security lets you restrict a role to a subset of documents in an index. The easiest way to get started with document- and field-level security is to open OpenSearch Dashboards and choose **Security**. Then choose **Roles**, create a new role, and review the **Index permissions** section. - -![Document- and field-level security screen in OpenSearch Dashboards]({{site.url}}{{site.baseurl}}/images/security-dls.png) - - -## Simple roles - -Document-level security uses the OpenSearch query DSL to define which documents a role grants access to. In OpenSearch Dashboards, choose an index pattern and provide a query in the **Document level security** section: - -```json -{ - "bool": { - "must": { - "match": { - "genres": "Comedy" - } - } - } -} -``` - -This query specifies that for the role to have access to a document, its `genres` field must include `Comedy`. - -A typical request to the `_search` API includes `{ "query": { ... } }` around the query, but in this case, you only need to specify the query itself. +Document-level security lets you restrict a role to a subset of documents in an index. +For more information about OpenSearch users and roles, see the [documentation](https://opensearch.org/docs/latest/security/access-control/users-roles/#create-roles). + +Use the following steps to get started with document-level and field-level security: +1. Open OpenSearch Dashboards. +2. Choose **Security** > **Roles**. +3. Select **Create Role** and provide a name for the role. +4. Review the **Index permissions** section and any necessary [index permissions](https://opensearch.org/docs/latest/security/access-control/permissions/) for the role. +5. Add document-level security, with the addition of a domain-specific language (DSL) query in the `Document level security - optional` section. A typical request sent to the `_search` API includes `{ "query": { ... } }` around the query, but with document-level security in OpenSearch Dashboards, you only need to specify the query itself. For example, the following DSL query specifies that for the new role to have access to a document, the query's `genres` field must include `Comedy`: + + ```json + { + "bool": { + "must": { + "match": { + "genres": "Comedy" + } + } + } + } + ``` + + - ![Document- and field-level security screen in OpenSearch Dashboards]({{site.url}}{{site.baseurl}}/images/security-dls.png) + +## Updating roles by accessing the REST API In the REST API, you provide the query as a string, so you must escape your quotes. This role allows a user to read any document in any index with the field `public` set to `true`: diff --git a/_security/access-control/users-roles.md b/_security/access-control/users-roles.md index 3b728029f8..ae7670bc29 100644 --- a/_security/access-control/users-roles.md +++ b/_security/access-control/users-roles.md @@ -14,6 +14,23 @@ The Security plugin includes an internal user database. Use this database in pla Roles are the core way of controlling access to your cluster. Roles contain any combination of cluster-wide permissions, index-specific permissions, document- and field-level security, and tenants. Then you map users to these roles so that users gain those permissions. +## Creating and editing OpenSearch roles + +You can update OpenSearch by using one of the following methods. + +### Using the API + +You can send HTTP requests to OpenSearch-provided endpoints to update security roles, permissions, and associated settings. This method offers granular control and automation capabilities for managing roles. + +### Using the UI (OpenSearch Dashboards) + +OpenSearch Dashboards provides a user-friendly interface for managing roles. Roles, permissions, and document-level security settings are configured in the Security section within OpenSearch Dashboards. When updating roles through the UI, OpenSearch Dashboards calls the API in the background to implement the changes. + +### Editing the `roles.yml` file + +If you want more granular control of your security configuration, you can edit roles and their associated permissions in the `roles.yml` file. This method provides direct access to the underlying configuration and can be version controlled for use in collaborative development environments. +For more information about creating roles, see the [Create roles][https://opensearch.org/docs/latest/security/access-control/users-roles/#create-roles) documentation. + Unless you need to create new [reserved or hidden users]({{site.url}}{{site.baseurl}}/security/access-control/api/#reserved-and-hidden-resources), we **highly** recommend using OpenSearch Dashboards or the REST API to create new users, roles, and role mappings. The `.yml` files are for initial setup, not ongoing use. {: .warning } @@ -75,6 +92,24 @@ See [YAML files]({{site.url}}{{site.baseurl}}/security/configuration/yaml/#roles See [Create role]({{site.url}}{{site.baseurl}}/security/access-control/api/#create-role). +## Edit roles + +You can edit roles using one of the following methods. + +### OpenSearch Dashboards + +1. Choose **Security** > **Roles**. In the **Create role** section, select **Explore existing roles**. +1. Select the role you want to edit. +1. Choose **edit role**. Make any necessary updates to the role. +1. To save your changes, select **Update**. + +### roles.yml + +See [YAML files]({{site.url}}{{site.baseurl}}/security/configuration/yaml/#rolesyml). + +### REST API + +See [Create role]({{site.url}}{{site.baseurl}}/security/access-control/api/#create-role). ## Map users to roles From 4c507b5f5fc34de15f086067892bf11ef0d182d7 Mon Sep 17 00:00:00 2001 From: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Date: Thu, 28 Mar 2024 17:31:45 -0400 Subject: [PATCH 15/18] Remove experimental feature labels and flags for OS Assistant (#6745) * Remove experimental feature labels and flags from OS Assistant-related pages Signed-off-by: Fanit Kolchina * Change generally available to introduced Signed-off-by: Fanit Kolchina * Remove ML settings for OS Assistant Signed-off-by: Fanit Kolchina * Query assistant is enabled by default Signed-off-by: Fanit Kolchina * Revised per SME comments Signed-off-by: Fanit Kolchina --------- Signed-off-by: Fanit Kolchina --- _automating-configurations/index.md | 2 +- _dashboards/dashboards-assistant/index.md | 8 +------ .../agents-tools/agents-tools-tutorial.md | 5 +--- _ml-commons-plugin/agents-tools/index.md | 23 +------------------ .../agents-tools/tools/agent-tool.md | 5 +--- .../agents-tools/tools/cat-index-tool.md | 5 +--- .../agents-tools/tools/index-mapping-tool.md | 5 +--- .../agents-tools/tools/index.md | 2 +- .../agents-tools/tools/ml-model-tool.md | 5 +--- .../agents-tools/tools/neural-sparse-tool.md | 5 +--- .../agents-tools/tools/ppl-tool.md | 5 +--- .../agents-tools/tools/rag-tool.md | 5 +--- .../agents-tools/tools/search-alerts-tool.md | 5 +--- .../tools/search-anomaly-detectors.md | 5 +--- .../tools/search-anomaly-results.md | 5 +--- .../agents-tools/tools/search-index-tool.md | 5 +--- .../tools/search-monitors-tool.md | 5 +--- .../agents-tools/tools/vector-db-tool.md | 5 +--- .../agents-tools/tools/visualization-tool.md | 5 +--- .../api/agent-apis/delete-agent.md | 5 +--- .../api/agent-apis/execute-agent.md | 5 +--- .../api/agent-apis/get-agent.md | 5 +--- _ml-commons-plugin/api/agent-apis/index.md | 5 +--- .../api/agent-apis/register-agent.md | 5 +--- .../api/agent-apis/search-agent.md | 5 +--- _ml-commons-plugin/custom-local-models.md | 2 +- _ml-commons-plugin/ml-dashboard.md | 2 +- _ml-commons-plugin/opensearch-assistant.md | 7 +----- _ml-commons-plugin/pretrained-models.md | 2 +- _ml-commons-plugin/using-ml-models.md | 2 +- _observing-your-data/event-analytics.md | 18 ++++----------- 31 files changed, 35 insertions(+), 138 deletions(-) diff --git a/_automating-configurations/index.md b/_automating-configurations/index.md index a7462ad16a..ef9cb4f850 100644 --- a/_automating-configurations/index.md +++ b/_automating-configurations/index.md @@ -8,7 +8,7 @@ redirect_from: /automating-configurations/ --- # Automating configurations -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } You can automate complex OpenSearch setup and preprocessing tasks by providing templates for common use cases. For example, automating machine learning (ML) setup tasks streamlines the use of OpenSearch ML offerings. diff --git a/_dashboards/dashboards-assistant/index.md b/_dashboards/dashboards-assistant/index.md index 9313dd2e97..d44e6b58e8 100644 --- a/_dashboards/dashboards-assistant/index.md +++ b/_dashboards/dashboards-assistant/index.md @@ -6,14 +6,11 @@ has_children: false has_toc: false --- -This is an experimental feature and is not recommended for use in a production environment. For updates on the feature's progress or to leave feedback, go to the [`dashboards-assistant` repository](https://github.com/opensearch-project/dashboards-assistant) on GitHub or the associated [OpenSearch forum thread](https://forum.opensearch.org/t/feedback-opensearch-assistant/16741). -{: .warning} - Note that machine learning models are probabilistic and that some may perform better than others, so the OpenSearch Assistant may occasionally produce inaccurate information. We recommend evaluating outputs for accuracy as appropriate to your use case, including reviewing the output or combining it with other verification factors. {: .important} # OpenSearch Assistant for OpenSearch Dashboards -Introduced 2.12 +**Introduced 2.13** {: .label .label-purple } The OpenSearch Assistant toolkit helps you create AI-powered assistants for OpenSearch Dashboards without requiring you to have specialized query tools or skills. @@ -49,9 +46,6 @@ A screenshot of the interface is shown in the following image. OpenSearch Assistant interface -For more information about ways to enable experimental features, see [Experimental feature flags]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/experimental/). -{: .note} - ## Configuring OpenSearch Assistant You can use the OpenSearch Dashboards interface to configure OpenSearch Assistant. Go to the [Getting started guide](https://github.com/opensearch-project/dashboards-assistant/blob/main/GETTING_STARTED_GUIDE.md) for step-by-step instructions. For the chatbot template, go to the [Flow Framework plugin](https://github.com/opensearch-project/flow-framework) documentation. You can modify this template to use your own model and customize the chatbot tools. diff --git a/_ml-commons-plugin/agents-tools/agents-tools-tutorial.md b/_ml-commons-plugin/agents-tools/agents-tools-tutorial.md index 109cbf8836..68d979d6d6 100644 --- a/_ml-commons-plugin/agents-tools/agents-tools-tutorial.md +++ b/_ml-commons-plugin/agents-tools/agents-tools-tutorial.md @@ -7,12 +7,9 @@ nav_order: 10 --- # Agents and tools tutorial -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - The following tutorial illustrates creating a flow agent for retrieval-augmented generation (RAG). A flow agent runs its configured tools sequentially, in the order specified. In this example, you'll create an agent with two tools: 1. `VectorDBTool`: The agent will use this tool to retrieve OpenSearch documents relevant to the user question. You'll ingest supplementary information into an OpenSearch index. To facilitate vector search, you'll deploy a text embedding model that translates text into vector embeddings. OpenSearch will translate the ingested documents into embeddings and store them in the index. When you provide a user question to the agent, the agent will construct a query from the question, run vector search on the OpenSearch index, and pass the relevant retrieved documents to the `MLModelTool`. diff --git a/_ml-commons-plugin/agents-tools/index.md b/_ml-commons-plugin/agents-tools/index.md index 016a077c62..ba88edef2f 100644 --- a/_ml-commons-plugin/agents-tools/index.md +++ b/_ml-commons-plugin/agents-tools/index.md @@ -7,12 +7,9 @@ nav_order: 27 --- # Agents and tools -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - You can automate machine learning (ML) tasks using agents and tools. An _agent_ orchestrates and runs ML models and tools. A _tool_ performs a set of specific tasks. Some examples of tools are the `VectorDBTool`, which supports vector search, and the `CATIndexTool`, which executes the `cat indices` operation. For a list of supported tools, see [Tools]({{site.url}}{{site.baseurl}}/ml-commons-plugin/agents-tools/tools/index/). ## Agents @@ -155,24 +152,6 @@ POST /_plugins/_ml/agents/_register It is important to provide thorough descriptions of the tools so that the LLM can decide in which situations to use those tools. {: .tip} -## Enabling the feature - -To enable agents and tools, configure the following setting: - -```yaml -plugins.ml_commons.agent_framework_enabled: true -``` -{% include copy.html %} - -For conversational agents, you also need to enable RAG for use in conversational search. To enable RAG, configure the following setting: - -```yaml -plugins.ml_commons.rag_pipeline_feature_enabled: true -``` -{% include copy.html %} - -For more information about ways to enable experimental features, see [Experimental feature flags]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/experimental/). - ## Next steps - For a list of supported tools, see [Tools]({{site.url}}{{site.baseurl}}/ml-commons-plugin/agents-tools/tools/index/). diff --git a/_ml-commons-plugin/agents-tools/tools/agent-tool.md b/_ml-commons-plugin/agents-tools/tools/agent-tool.md index 272456d693..272af51e4d 100644 --- a/_ml-commons-plugin/agents-tools/tools/agent-tool.md +++ b/_ml-commons-plugin/agents-tools/tools/agent-tool.md @@ -10,13 +10,10 @@ grand_parent: Agents and tools # Agent tool -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - The `AgentTool` runs any agent. ## Step 1: Set up an agent for AgentTool to run diff --git a/_ml-commons-plugin/agents-tools/tools/cat-index-tool.md b/_ml-commons-plugin/agents-tools/tools/cat-index-tool.md index 77b28ed527..50ccf28b9b 100644 --- a/_ml-commons-plugin/agents-tools/tools/cat-index-tool.md +++ b/_ml-commons-plugin/agents-tools/tools/cat-index-tool.md @@ -10,13 +10,10 @@ grand_parent: Agents and tools # CAT Index tool -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - The `CatIndexTool` retrieves index information for the OpenSearch cluster, similarly to the [CAT Indices API]({{site.url}}{{site.baseurl}}/api-reference/cat/cat-indices/). ## Step 1: Register a flow agent that will run the CatIndexTool diff --git a/_ml-commons-plugin/agents-tools/tools/index-mapping-tool.md b/_ml-commons-plugin/agents-tools/tools/index-mapping-tool.md index f27b0592a8..8649d2d74d 100644 --- a/_ml-commons-plugin/agents-tools/tools/index-mapping-tool.md +++ b/_ml-commons-plugin/agents-tools/tools/index-mapping-tool.md @@ -10,13 +10,10 @@ grand_parent: Agents and tools # Index Mapping tool -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - The `IndexMappingTool` retrieves mapping and setting information for indexes in your cluster. ## Step 1: Register a flow agent that will run the IndexMappingTool diff --git a/_ml-commons-plugin/agents-tools/tools/index.md b/_ml-commons-plugin/agents-tools/tools/index.md index fe6d574d63..8db522006e 100644 --- a/_ml-commons-plugin/agents-tools/tools/index.md +++ b/_ml-commons-plugin/agents-tools/tools/index.md @@ -10,7 +10,7 @@ redirect_from: --- # Tools -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } A _tool_ performs a set of specific tasks. The following table lists all tools that OpenSearch supports. diff --git a/_ml-commons-plugin/agents-tools/tools/ml-model-tool.md b/_ml-commons-plugin/agents-tools/tools/ml-model-tool.md index c0f8aeab86..ceeda40528 100644 --- a/_ml-commons-plugin/agents-tools/tools/ml-model-tool.md +++ b/_ml-commons-plugin/agents-tools/tools/ml-model-tool.md @@ -10,13 +10,10 @@ grand_parent: Agents and tools # ML Model tool -**Introduced 2.12** +plugins.ml_commons.rag_pipeline_feature_enabled: true {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - The `MLModelTool` runs a machine learning (ML) model and returns inference results. ## Step 1: Create a connector for a model diff --git a/_ml-commons-plugin/agents-tools/tools/neural-sparse-tool.md b/_ml-commons-plugin/agents-tools/tools/neural-sparse-tool.md index bc1fd4845e..9fee4dcbd2 100644 --- a/_ml-commons-plugin/agents-tools/tools/neural-sparse-tool.md +++ b/_ml-commons-plugin/agents-tools/tools/neural-sparse-tool.md @@ -10,13 +10,10 @@ grand_parent: Agents and tools # Neural Sparse Search tool -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - The `NeuralSparseSearchTool` performs sparse vector retrieval. For more information about neural sparse search, see [Neural sparse search]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/). ## Step 1: Register and deploy a sparse encoding model diff --git a/_ml-commons-plugin/agents-tools/tools/ppl-tool.md b/_ml-commons-plugin/agents-tools/tools/ppl-tool.md index f153ca88f3..72d8ba30b5 100644 --- a/_ml-commons-plugin/agents-tools/tools/ppl-tool.md +++ b/_ml-commons-plugin/agents-tools/tools/ppl-tool.md @@ -9,12 +9,9 @@ grand_parent: Agents and tools --- # PPL tool -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - The `PPLTool` translates natural language into a PPL query. The tool provides an `execute` flag to specify whether to run the query. If you set the flag to `true`, the `PPLTool` runs the query and returns the query and the results. ## Prerequisite diff --git a/_ml-commons-plugin/agents-tools/tools/rag-tool.md b/_ml-commons-plugin/agents-tools/tools/rag-tool.md index ae3ad1281a..1f6fafe49a 100644 --- a/_ml-commons-plugin/agents-tools/tools/rag-tool.md +++ b/_ml-commons-plugin/agents-tools/tools/rag-tool.md @@ -10,13 +10,10 @@ grand_parent: Agents and tools # RAG tool -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - The `RAGTool` performs retrieval-augmented generation (RAG). For more information about RAG, see [Conversational search]({{site.url}}{{site.baseurl}}/search-plugins/conversational-search/). RAG calls a large language model (LLM) and supplements its knowledge by providing relevant OpenSearch documents along with the user question. To retrieve relevant documents from an OpenSearch index, you'll need a text embedding model that facilitates vector search. diff --git a/_ml-commons-plugin/agents-tools/tools/search-alerts-tool.md b/_ml-commons-plugin/agents-tools/tools/search-alerts-tool.md index 387ef1cbab..76f9e4b4dc 100644 --- a/_ml-commons-plugin/agents-tools/tools/search-alerts-tool.md +++ b/_ml-commons-plugin/agents-tools/tools/search-alerts-tool.md @@ -10,13 +10,10 @@ grand_parent: Agents and tools # Search Alerts tool -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - The `SearchAlertsTool` retrieves information about generated alerts. For more information about alerts, see [Alerting]({{site.url}}{{site.baseurl}}/observing-your-data/alerting/index/). ## Step 1: Register a flow agent that will run the SearchAlertsTool diff --git a/_ml-commons-plugin/agents-tools/tools/search-anomaly-detectors.md b/_ml-commons-plugin/agents-tools/tools/search-anomaly-detectors.md index de93a404a3..9f31dea057 100644 --- a/_ml-commons-plugin/agents-tools/tools/search-anomaly-detectors.md +++ b/_ml-commons-plugin/agents-tools/tools/search-anomaly-detectors.md @@ -10,13 +10,10 @@ grand_parent: Agents and tools # Search Anomaly Detectors tool -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - The `SearchAnomalyDetectorsTool` retrieves information about anomaly detectors set up on your cluster. For more information about anomaly detectors, see [Anomaly detection]({{site.url}}{{site.baseurl}}/observing-your-data/ad/index/). ## Step 1: Register a flow agent that will run the SearchAnomalyDetectorsTool diff --git a/_ml-commons-plugin/agents-tools/tools/search-anomaly-results.md b/_ml-commons-plugin/agents-tools/tools/search-anomaly-results.md index bce27bba55..2f2728e32d 100644 --- a/_ml-commons-plugin/agents-tools/tools/search-anomaly-results.md +++ b/_ml-commons-plugin/agents-tools/tools/search-anomaly-results.md @@ -10,13 +10,10 @@ grand_parent: Agents and tools # Search Anomaly Results tool -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - The `SearchAnomalyResultsTool` retrieves information about anomaly detector results. For more information about anomaly detectors, see [Anomaly detection]({{site.url}}{{site.baseurl}}/observing-your-data/ad/index/). ## Step 1: Register a flow agent that will run the SearchAnomalyResultsTool diff --git a/_ml-commons-plugin/agents-tools/tools/search-index-tool.md b/_ml-commons-plugin/agents-tools/tools/search-index-tool.md index 86ecbfc609..b023522893 100644 --- a/_ml-commons-plugin/agents-tools/tools/search-index-tool.md +++ b/_ml-commons-plugin/agents-tools/tools/search-index-tool.md @@ -10,13 +10,10 @@ grand_parent: Agents and tools # Search Index tool -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - The `SearchIndexTool` searches an index using a query written in query domain-specific language (DSL) and returns the query results. ## Step 1: Register a flow agent that will run the SearchIndexTool diff --git a/_ml-commons-plugin/agents-tools/tools/search-monitors-tool.md b/_ml-commons-plugin/agents-tools/tools/search-monitors-tool.md index 2b746d3453..77b51d4964 100644 --- a/_ml-commons-plugin/agents-tools/tools/search-monitors-tool.md +++ b/_ml-commons-plugin/agents-tools/tools/search-monitors-tool.md @@ -10,13 +10,10 @@ grand_parent: Agents and tools # Search Monitors tool -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - The `SearchMonitorsTool` retrieves information about alerting monitors set up on your cluster. For more information about alerting monitors, see [Monitors]({{site.url}}{{site.baseurl}}/observing-your-data/alerting/monitors/). ## Step 1: Register a flow agent that will run the SearchMonitorsTool diff --git a/_ml-commons-plugin/agents-tools/tools/vector-db-tool.md b/_ml-commons-plugin/agents-tools/tools/vector-db-tool.md index d8b8083df3..9093541cbb 100644 --- a/_ml-commons-plugin/agents-tools/tools/vector-db-tool.md +++ b/_ml-commons-plugin/agents-tools/tools/vector-db-tool.md @@ -10,13 +10,10 @@ grand_parent: Agents and tools # Vector DB tool -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - The `VectorDBTool` performs dense vector retrieval. For more information about OpenSearch vector database capabilities, see [neural search]({{site.url}}{{site.baseurl}}/search-plugins/neural-search/). ## Step 1: Register and deploy a sparse encoding model diff --git a/_ml-commons-plugin/agents-tools/tools/visualization-tool.md b/_ml-commons-plugin/agents-tools/tools/visualization-tool.md index 1407232555..98457932c2 100644 --- a/_ml-commons-plugin/agents-tools/tools/visualization-tool.md +++ b/_ml-commons-plugin/agents-tools/tools/visualization-tool.md @@ -9,12 +9,9 @@ grand_parent: Agents and tools --- # Visualization tool -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - Use the `VisualizationTool` to find visualizations relevant to a question. ## Step 1: Register a flow agent that will run the VisualizationTool diff --git a/_ml-commons-plugin/api/agent-apis/delete-agent.md b/_ml-commons-plugin/api/agent-apis/delete-agent.md index 0327c3bf04..ddde8fb19b 100644 --- a/_ml-commons-plugin/api/agent-apis/delete-agent.md +++ b/_ml-commons-plugin/api/agent-apis/delete-agent.md @@ -7,12 +7,9 @@ nav_order: 50 --- # Delete an agent -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - You can use this API to delete an agent based on the `agent_id`. ## Path and HTTP methods diff --git a/_ml-commons-plugin/api/agent-apis/execute-agent.md b/_ml-commons-plugin/api/agent-apis/execute-agent.md index 8302ac265f..27d50bced0 100644 --- a/_ml-commons-plugin/api/agent-apis/execute-agent.md +++ b/_ml-commons-plugin/api/agent-apis/execute-agent.md @@ -7,12 +7,9 @@ nav_order: 20 --- # Execute an agent -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - When an agent is executed, it runs the tools with which it is configured. ### Path and HTTP methods diff --git a/_ml-commons-plugin/api/agent-apis/get-agent.md b/_ml-commons-plugin/api/agent-apis/get-agent.md index be49a87502..6190406649 100644 --- a/_ml-commons-plugin/api/agent-apis/get-agent.md +++ b/_ml-commons-plugin/api/agent-apis/get-agent.md @@ -7,12 +7,9 @@ nav_order: 20 --- # Get an agent -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - You can retrieve agent information using the `agent_id`. ## Path and HTTP methods diff --git a/_ml-commons-plugin/api/agent-apis/index.md b/_ml-commons-plugin/api/agent-apis/index.md index 4b6954a79f..72bf6082ce 100644 --- a/_ml-commons-plugin/api/agent-apis/index.md +++ b/_ml-commons-plugin/api/agent-apis/index.md @@ -9,12 +9,9 @@ redirect_from: /ml-commons-plugin/api/agent-apis/ --- # Agent APIs -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - You can automate machine learning (ML) tasks using agents and tools. An _agent_ orchestrates and runs ML models and tools. For more information, see [Agents and tools]({{site.url}}{{site.baseurl}}/ml-commons-plugin/agents-tools/index/). ML Commons supports the following agent-level APIs: diff --git a/_ml-commons-plugin/api/agent-apis/register-agent.md b/_ml-commons-plugin/api/agent-apis/register-agent.md index 75a63d40cf..820bb923f7 100644 --- a/_ml-commons-plugin/api/agent-apis/register-agent.md +++ b/_ml-commons-plugin/api/agent-apis/register-agent.md @@ -7,12 +7,9 @@ nav_order: 10 --- # Register an agent -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - Use this API to register an agent. Agents may be of the following types: diff --git a/_ml-commons-plugin/api/agent-apis/search-agent.md b/_ml-commons-plugin/api/agent-apis/search-agent.md index c5df482ac2..3d950cde8f 100644 --- a/_ml-commons-plugin/api/agent-apis/search-agent.md +++ b/_ml-commons-plugin/api/agent-apis/search-agent.md @@ -7,12 +7,9 @@ nav_order: 30 --- # Search for an agent -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/ml-commons/issues/1161). -{: .warning} - Use this command to search for agents you've already created. You can provide any OpenSearch search query in the request body. ## Path and HTTP methods diff --git a/_ml-commons-plugin/custom-local-models.md b/_ml-commons-plugin/custom-local-models.md index ee44a0a529..a265d8804a 100644 --- a/_ml-commons-plugin/custom-local-models.md +++ b/_ml-commons-plugin/custom-local-models.md @@ -7,7 +7,7 @@ nav_order: 120 --- # Custom local models -**Generally available 2.9** +**Introduced 2.9** {: .label .label-purple } To use a custom model locally, you can upload it to the OpenSearch cluster. diff --git a/_ml-commons-plugin/ml-dashboard.md b/_ml-commons-plugin/ml-dashboard.md index 3195aff8de..20c4e636bb 100644 --- a/_ml-commons-plugin/ml-dashboard.md +++ b/_ml-commons-plugin/ml-dashboard.md @@ -7,7 +7,7 @@ redirect_from: --- # Managing ML models in OpenSearch Dashboards -**Generally available 2.9** +**Introduced 2.9** {: .label .label-purple } Administrators of machine learning (ML) clusters can use OpenSearch Dashboards to manage and check the status of ML models running inside a cluster. This can help ML developers provision nodes to ensure their models run efficiently. diff --git a/_ml-commons-plugin/opensearch-assistant.md b/_ml-commons-plugin/opensearch-assistant.md index 3a8e0c8703..0a058d73a0 100644 --- a/_ml-commons-plugin/opensearch-assistant.md +++ b/_ml-commons-plugin/opensearch-assistant.md @@ -7,12 +7,9 @@ nav_order: 28 --- # OpenSearch Assistant Toolkit -**Introduced 2.12** +**Introduced 2.13** {: .label .label-purple } -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [OpenSearch forum thread](https://forum.opensearch.org/t/feedback-opensearch-assistant/16741). -{: .warning} - The OpenSearch Assistant Toolkit helps you create AI-powered assistants for OpenSearch Dashboards. The toolkit includes the following elements: - [**Agents and tools**]({{site.url}}{{site.baseurl}}/ml-commons-plugin/agents-tools/index/): _Agents_ interface with a large language model (LLM) and execute high-level tasks, such as summarization or generating Piped Processing Language (PPL) queries from natural language. The agent's high-level tasks consist of low-level tasks called _tools_, which can be reused by multiple agents. @@ -36,8 +33,6 @@ To enable OpenSearch Assistant, perform the following steps: ``` {% include copy.html %} -For more information about ways to enable experimental features, see [Experimental feature flags]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/experimental/). - ## Next steps - For more information about the OpenSearch Assistant UI, see [OpenSearch Assistant for OpenSearch Dashboards]({{site.url}}{{site.baseurl}}/dashboards/dashboards-assistant/index/) \ No newline at end of file diff --git a/_ml-commons-plugin/pretrained-models.md b/_ml-commons-plugin/pretrained-models.md index c68f9c8bab..8847d36291 100644 --- a/_ml-commons-plugin/pretrained-models.md +++ b/_ml-commons-plugin/pretrained-models.md @@ -7,7 +7,7 @@ nav_order: 120 --- # OpenSearch-provided pretrained models -**Generally available 2.9** +**Introduced 2.9** {: .label .label-purple } OpenSearch provides a variety of open-source pretrained models that can assist with a range of machine learning (ML) search and analytics use cases. You can upload any supported model to the OpenSearch cluster and use it locally. diff --git a/_ml-commons-plugin/using-ml-models.md b/_ml-commons-plugin/using-ml-models.md index 5c23e19ab6..db50626721 100644 --- a/_ml-commons-plugin/using-ml-models.md +++ b/_ml-commons-plugin/using-ml-models.md @@ -10,7 +10,7 @@ redirect_from: --- # Using ML models within OpenSearch -**Generally available 2.9** +**Introduced 2.9** {: .label .label-purple } To integrate machine learning (ML) models into your OpenSearch cluster, you can upload and serve them locally. Choose one of the following options: diff --git a/_observing-your-data/event-analytics.md b/_observing-your-data/event-analytics.md index dd936b7d27..b8fe72964c 100644 --- a/_observing-your-data/event-analytics.md +++ b/_observing-your-data/event-analytics.md @@ -30,9 +30,6 @@ For more information about building PPL queries, see [Piped Processing Language] ### OpenSearch Dashboards Query Assistant -This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [OpenSearch forum thread](https://forum.opensearch.org/t/feedback-opensearch-assistant/16741). -{: .warning} - Note that machine learning models are probabilistic and that some may perform better than others, so the OpenSearch Assistant may occasionally produce inaccurate information. We recommend evaluating outputs for accuracy as appropriate to your use case, including reviewing the output or combining it with other verification factors. {: .important} @@ -42,28 +39,23 @@ To simplify query building, the **OpenSearch Assistant** toolkit offers an assis #### Enabling Query Assistant -To enable **Query Assistant** in OpenSearch Dashboards, locate your copy of the `opensearch_dashboards.yml` file and set the following option: - -``` -observability.query_assist.enabled: true -observability.query_assist.ppl_agent_name: "PPL agent" -``` +By default, **Query Assistant** is enabled in OpenSearch Dashboards. To enable summarization of responses, locate your copy of the `opensearch_dashboards.yml` file and set the following option: -To enable summarization of responses, locate your copy of the `opensearch_dashboards.yml` file and set the following option: - -``` +```yaml observability.summarize.enabled: true observability.summarize.response_summary_agent_name: "Response summary agent" observability.summarize.error_summary_agent_name: "Error summary agent" ``` +To disable Query Assistant, add `observability.query_assist.enabled: false` to your `opensearch_dashboards.yml`. + #### Setting up Query Assistant To set up **Query Assistant**, follow the steps in the [Getting started guide](https://github.com/opensearch-project/dashboards-assistant/blob/main/GETTING_STARTED_GUIDE.md) on GitHub. This guide provides step-by-step setup instructions for **OpenSearch Assistant** and **Query Assistant**. To set up **Query Assistant** only, use the `query-assist-agent` template included in the guide. ## Saving a visualization -After Dashboards generates a visualization, save it if you want to revisit it or include it in an [operational panel]({{site.url}}{{site.baseurl}}/observing-your-data/operational-panels). To save a visualization, expand the **Save** dropdown menu in the upper-right corner, enter a name for the visualization, and then select the **Save** button. You can reopen saved visualizations on the event analytics page. +After Dashboards generates a visualization, save it if you want to revisit it or include it in an [operational panel]({{site.url}}{{site.baseurl}}/observing-your-data/operational-panels/). To save a visualization, expand the **Save** dropdown menu in the upper-right corner, enter a name for the visualization, and then select the **Save** button. You can reopen saved visualizations on the event analytics page. ## Creating event analytics visualizations and adding them to dashboards From 26c53abf4f1fa275eee1d20dbf6ec6d1c6b38a2a Mon Sep 17 00:00:00 2001 From: Jing Zhang Date: Thu, 28 Mar 2024 14:38:18 -0700 Subject: [PATCH 16/18] Add guardrails for remote model (#6750) * guardrails for remote model Signed-off-by: Jing Zhang * Doc review Signed-off-by: Fanit Kolchina * Add guardrails dedicated page Signed-off-by: Fanit Kolchina * Reword and reformat Signed-off-by: Fanit Kolchina * Add prerequisites Signed-off-by: Fanit Kolchina * Change example Signed-off-by: Fanit Kolchina * Add a link to query string query Signed-off-by: Fanit Kolchina * Add regex and responses Signed-off-by: Fanit Kolchina * Add a sentence about regex Signed-off-by: Fanit Kolchina * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestions from code review Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Add type to guardrails Signed-off-by: Fanit Kolchina --------- Signed-off-by: Jing Zhang Signed-off-by: Fanit Kolchina Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Fanit Kolchina Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower --- .../api/model-apis/register-model.md | 69 +++- .../api/model-apis/update-model.md | 33 +- .../remote-models/guardrails.md | 298 ++++++++++++++++++ _ml-commons-plugin/remote-models/index.md | 1 + 4 files changed, 398 insertions(+), 3 deletions(-) create mode 100644 _ml-commons-plugin/remote-models/guardrails.md diff --git a/_ml-commons-plugin/api/model-apis/register-model.md b/_ml-commons-plugin/api/model-apis/register-model.md index 880cbd68e5..dd157ed264 100644 --- a/_ml-commons-plugin/api/model-apis/register-model.md +++ b/_ml-commons-plugin/api/model-apis/register-model.md @@ -183,8 +183,9 @@ Field | Data type | Required/Optional | Description `description` | String | Optional| The model description. | `model_group_id` | String | Optional | The model group ID of the model group to register this model to. `is_enabled`| Boolean | Specifies whether the model is enabled. Disabling the model makes it unavailable for Predict API requests, regardless of the model's deployment status. Default is `true`. +`guardrails`| Object | Optional | The guardrails for the model input. For more information, see [Guardrails](#the-guardrails-parameter).| -#### Example request: Remote model with a standalone connector +#### Example request: Externally hosted with a standalone connector ```json POST /_plugins/_ml/models/_register @@ -198,7 +199,7 @@ POST /_plugins/_ml/models/_register ``` {% include copy-curl.html %} -#### Example request: Remote model with a connector specified as part of the model +#### Example request: Externally hosted with a connector specified as part of the model ```json POST /_plugins/_ml/models/_register @@ -248,6 +249,70 @@ OpenSearch responds with the `task_id` and task `status`. } ``` +### The `guardrails` parameter + +Guardrails are safety measures for large language models (LLMs). They provide a set of rules and boundaries that control how an LLM behaves and what kind of output it generates. + +To register an externally hosted model with guardrails, provide the `guardrails` parameter, which supports the following fields. All fields are optional. + +Field | Data type | Description +:--- | :--- | :--- +`type` | String | The guardrail type. Currently, only `local_regex` is supported. +`input_guardrail`| Object | The guardrail for the model input. | +`output_guardrail`| Object | The guardrail for the model output. | +`stop_words`| Object | The list of indexes containing stopwords used for the model input/output validation. If the model prompt/response contains a stopword contained in any of the indexes, the predict request on this model is rejected. | +`index_name`| Object | The name of the index storing the stopwords. | +`source_fields`| Object | The name of the field storing the stopwords. | +`regex`| Object | A regular expression used for input/output validation. If the model prompt/response matches the regular expression, the predict request on this model is rejected. | + +#### Example request: Externally hosted model with guardrails + +```json +POST /_plugins/_ml/models/_register +{ + "name": "openAI-gpt-3.5-turbo", + "function_name": "remote", + "model_group_id": "1jriBYsBq7EKuKzZX131", + "description": "test model", + "connector_id": "a1eMb4kBJ1eYAeTMAljY", + "guardrails": { + "type": "local_regex", + "input_guardrail": { + "stop_words": [ + { + "index_name": "stop_words_input", + "source_fields": ["title"] + } + ], + "regex": ["regex1", "regex2"] + }, + "output_guardrail": { + "stop_words": [ + { + "index_name": "stop_words_output", + "source_fields": ["title"] + } + ], + "regex": ["regex1", "regex2"] + } + } +} +``` +{% include copy-curl.html %} + +For a complete example, see [Guardrails]({{site.url}}{{site.baseurl}}/ml-commons-plugin/remote-models/guardrails/). + +#### Example response + +OpenSearch responds with the `task_id` and task `status`: + +```json +{ + "task_id" : "ew8I44MBhyWuIwnfvDIH", + "status" : "CREATED" +} +``` + ## Check the status of model registration To see the status of your model registration and retrieve the model ID created for the new model version, pass the `task_id` as a path parameter to the Tasks API: diff --git a/_ml-commons-plugin/api/model-apis/update-model.md b/_ml-commons-plugin/api/model-apis/update-model.md index 380f422272..877d0b5c51 100644 --- a/_ml-commons-plugin/api/model-apis/update-model.md +++ b/_ml-commons-plugin/api/model-apis/update-model.md @@ -36,6 +36,7 @@ Field | Data type | Description `rate_limiter` | Object | Limits the number of times any user can call the Predict API on the model. For more information, see [Rate limiting inference calls]({{site.url}}{{site.baseurl}}/ml-commons-plugin/integrating-ml-models/#rate-limiting-inference-calls). `rate_limiter.limit` | Integer | The maximum number of times any user can call the Predict API on the model per `unit` of time. By default, there is no limit on the number of Predict API calls. Once you set a limit, you cannot reset it to no limit. As an alternative, you can specify a high limit value and a small time unit, for example, 1 request per nanosecond. `rate_limiter.unit` | String | The unit of time for the rate limiter. Valid values are `DAYS`, `HOURS`, `MICROSECONDS`, `MILLISECONDS`, `MINUTES`, `NANOSECONDS`, and `SECONDS`. +`guardrails`| Object | The guardrails for the model. #### Example request: Disabling a model @@ -62,6 +63,35 @@ PUT /_plugins/_ml/models/T_S-cY0BKCJ3ot9qr0aP ``` {% include copy-curl.html %} +#### Example request: Updating the guardrails + +```json +PUT /_plugins/_ml/models/MzcIJX8BA7mbufL6DOwl +{ + "guardrails": { + "input_guardrail": { + "stop_words": [ + { + "index_name": "updated_stop_words_input", + "source_fields": ["updated_title"] + } + ], + "regex": ["updated_regex1", "updated_regex2"] + }, + "output_guardrail": { + "stop_words": [ + { + "index_name": "updated_stop_words_output", + "source_fields": ["updated_title"] + } + ], + "regex": ["updated_regex1", "updated_regex2"] + } + } +} +``` +{% include copy-curl.html %} + #### Example response ```json @@ -78,4 +108,5 @@ PUT /_plugins/_ml/models/T_S-cY0BKCJ3ot9qr0aP "_seq_no": 48, "_primary_term": 4 } -``` \ No newline at end of file +``` + diff --git a/_ml-commons-plugin/remote-models/guardrails.md b/_ml-commons-plugin/remote-models/guardrails.md new file mode 100644 index 0000000000..ca34eb335c --- /dev/null +++ b/_ml-commons-plugin/remote-models/guardrails.md @@ -0,0 +1,298 @@ +--- +layout: default +title: Guardrails +has_children: false +has_toc: false +nav_order: 70 +parent: Connecting to externally hosted models +grand_parent: Integrating ML models +--- + +# Configuring model guardrails +**Introduced 2.13** +{: .label .label-purple } + +Guardrails can guide a large language model (LLM) toward desired behavior. They act as a filter, preventing the LLM from generating output that is harmful or violates ethical principles and facilitating safer use of AI. Guardrails also cause the LLM to produce more focused and relevant output. + +To configure guardrails for your LLM, you can provide a list of words to be prohibited in the input or output of the model. Alternatively, you can provide a regular expression against which the model input or output will be matched. + +## Prerequisites + +Before you start, make sure you have fulfilled the [prerequisites]({{site.url}}{{site.baseurl}}/ml-commons-plugin/remote-models/index/#prerequisites) for connecting to an externally hosted model. + +## Step 1: Create a guardrail index + +To start, create an index that will store the excluded words (_stopwords_). In the index settings, specify a `title` field, which will contain excluded words, and a `query` field of the [percolator]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/percolator/) type. The percolator query will be used to match the LLM input or output: + +```json +PUT /words0 +{ + "mappings": { + "properties": { + "title": { + "type": "text" + }, + "query": { + "type": "percolator" + } + } + } +} +``` +{% include copy-curl.html %} + +## Step 2: Index excluded words or phrases + +Next, index a query string query that will be used to match excluded words in the model input or output: + +```json +PUT /words0/_doc/1?refresh +{ + "query": { + "query_string": { + "query": "title: blacklist" + } + } +} +``` +{% include copy-curl.html %} + +```json +PUT /words0/_doc/2?refresh +{ + "query": { + "query_string": { + "query": "title: \"Master slave architecture\"" + } + } +} +``` +{% include copy-curl.html %} + +For more query string options, see [Query string query]({{site.url}}{{site.baseurl}}/query-dsl/full-text/query-string/). + +## Step 3: Register a model group + +To register a model group, send the following request: + +```json +POST /_plugins/_ml/model_groups/_register +{ + "name": "bedrock", + "description": "This is a public model group." +} +``` +{% include copy-curl.html %} + +The response contains the model group ID that you'll use to register a model to this model group: + +```json +{ + "model_group_id": "wlcnb4kBJ1eYAeTMHlV6", + "status": "CREATED" +} +``` + +To learn more about model groups, see [Model access control]({{site.url}}{{site.baseurl}}/ml-commons-plugin/model-access-control/). + +## Step 4: Create a connector + +Now you can create a connector for the model. In this example, you'll create a connector to the Anthropic Claude model hosted on Amazon Bedrock: + +```json +POST /_plugins/_ml/connectors/_create +{ + "name": "BedRock test claude Connector", + "description": "The connector to BedRock service for claude model", + "version": 1, + "protocol": "aws_sigv4", + "parameters": { + "region": "us-east-1", + "service_name": "bedrock", + "anthropic_version": "bedrock-2023-05-31", + "endpoint": "bedrock.us-east-1.amazonaws.com", + "auth": "Sig_V4", + "content_type": "application/json", + "max_tokens_to_sample": 8000, + "temperature": 0.0001, + "response_filter": "$.completion" + }, + "credential": { + "access_key": "", + "secret_key": "" + }, + "actions": [ + { + "action_type": "predict", + "method": "POST", + "url": "https://bedrock-runtime.us-east-1.amazonaws.com/model/anthropic.claude-v2/invoke", + "headers": { + "content-type": "application/json", + "x-amz-content-sha256": "required" + }, + "request_body": "{\"prompt\":\"${parameters.prompt}\", \"max_tokens_to_sample\":${parameters.max_tokens_to_sample}, \"temperature\":${parameters.temperature}, \"anthropic_version\":\"${parameters.anthropic_version}\" }" + } + ] +} +``` +{% include copy-curl.html %} + +The response contains the connector ID for the newly created connector: + +```json +{ + "connector_id": "a1eMb4kBJ1eYAeTMAljY" +} +``` + +## Step 5: Register and deploy the model with guardrails + +To register an externally hosted model, provide the model group ID from step 3 and the connector ID from step 4 in the following request. To configure guardrails, include the `guardrails` object: + +```json +POST /_plugins/_ml/models/_register?deploy=true +{ + "name": "Bedrock Claude V2 model", + "function_name": "remote", + "model_group_id": "wlcnb4kBJ1eYAeTMHlV6", + "description": "test model", + "connector_id": "a1eMb4kBJ1eYAeTMAljY", + "guardrails": { + "type": "local_regex", + "input_guardrail": { + "stop_words": [ + { + "index_name": "words0", + "source_fields": [ + "title" + ] + } + ], + "regex": [ + ".*abort.*", + ".*kill.*" + ] + }, + "output_guardrail": { + "stop_words": [ + { + "index_name": "words0", + "source_fields": [ + "title" + ] + } + ], + "regex": [ + ".*abort.*", + ".*kill.*" + ] + } + } +} +``` +{% include copy-curl.html %} + +For more information, see [The `guardrails` parameter]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/model-apis/register-model/#the-guardrails-parameter). + +OpenSearch returns the task ID of the register operation: + +```json +{ + "task_id": "cVeMb4kBJ1eYAeTMFFgj", + "status": "CREATED" +} +``` + +To check the status of the operation, provide the task ID to the [Tasks API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/tasks-apis/get-task/): + +```bash +GET /_plugins/_ml/tasks/cVeMb4kBJ1eYAeTMFFgj +``` +{% include copy-curl.html %} + +When the operation is complete, the state changes to `COMPLETED`: + +```json +{ + "model_id": "cleMb4kBJ1eYAeTMFFg4", + "task_type": "DEPLOY_MODEL", + "function_name": "REMOTE", + "state": "COMPLETED", + "worker_node": [ + "n-72khvBTBi3bnIIR8FTTw" + ], + "create_time": 1689793851077, + "last_update_time": 1689793851101, + "is_async": true +} +``` + +## Step 6 (Optional): Test the model + +To demonstrate how guardrails are applied, first run the predict operation that does not contain any excluded words: + +```json +POST /_plugins/_ml/models/p94dYo4BrXGpZpgPp98E/_predict +{ + "parameters": { + "prompt": "\n\nHuman:this is a test\n\nnAssistant:" + } +} +``` +{% include copy-curl.html %} + +The response contains inference results: + +```json +{ + "inference_results": [ + { + "output": [ + { + "name": "response", + "dataAsMap": { + "response": " Thank you for the test, I appreciate you taking the time to interact with me. I'm an AI assistant created by Anthropic to be helpful, harmless, and honest." + } + } + ], + "status_code": 200 + } + ] +} +``` + +Then run the predict operation that contains excluded words: + +```json +POST /_plugins/_ml/models/p94dYo4BrXGpZpgPp98E/_predict +{ + "parameters": { + "prompt": "\n\nHuman:this is a test of Master slave architecture\n\nnAssistant:" + } +} +``` +{% include copy-curl.html %} + +The response contains an error message because guardrails were triggered: + +```json +{ + "error": { + "root_cause": [ + { + "type": "illegal_argument_exception", + "reason": "guardrails triggered for user input" + } + ], + "type": "illegal_argument_exception", + "reason": "guardrails triggered for user input" + }, + "status": 400 +} +``` + +Guardrails are also triggered when a prompt matches the supplied regular expression. + +## Next steps + +- For more information about configuring guardrails, see [The `guardrails` parameter]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/model-apis/register-model/#the-guardrails-parameter). \ No newline at end of file diff --git a/_ml-commons-plugin/remote-models/index.md b/_ml-commons-plugin/remote-models/index.md index 657d7254be..0b92adaab6 100644 --- a/_ml-commons-plugin/remote-models/index.md +++ b/_ml-commons-plugin/remote-models/index.md @@ -328,3 +328,4 @@ To learn how to use the model for vector search, see [Using an ML model for neur - For more information about connector parameters, see [Connector blueprints]({{site.url}}{{site.baseurl}}/ml-commons-plugin/remote-models/blueprints/). - For more information about managing ML models in OpenSearch, see [Using ML models within OpenSearch]({{site.url}}{{site.baseurl}}/ml-commons-plugin/model-serving-framework/). - For more information about interacting with ML models in OpenSearch, see [Managing ML models in OpenSearch Dashboards]({{site.url}}{{site.baseurl}}/ml-commons-plugin/ml-dashboard/) +For instructions on how to configure model guardrails, see [Guardrails]({{site.url}}{{site.baseurl}}/ml-commons-plugin/remote-models/guardrails/). From 2e41a57809e22aad0d8a7da5ba6ef5c52fabba8e Mon Sep 17 00:00:00 2001 From: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Date: Thu, 28 Mar 2024 18:45:24 -0500 Subject: [PATCH 17/18] Fix table in S3 documentation (#6810) Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --- _data-prepper/pipelines/configuration/sinks/opensearch.md | 1 - 1 file changed, 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/sinks/opensearch.md b/_data-prepper/pipelines/configuration/sinks/opensearch.md index d485fbb2b9..628515a985 100644 --- a/_data-prepper/pipelines/configuration/sinks/opensearch.md +++ b/_data-prepper/pipelines/configuration/sinks/opensearch.md @@ -91,7 +91,6 @@ Option | Required | Type | Description `document_root_key` | No | String | The key in the event that will be used as the root in the document. The default is the root of the event. If the key does not exist, then the entire event is written as the document. If `document_root_key` is of a basic value type, such as a string or integer, then the document will have a structure of `{"data": }`. `serverless` | No | Boolean | Determines whether the OpenSearch backend is Amazon OpenSearch Serverless. Set this value to `true` when the destination for the `opensearch` sink is an Amazon OpenSearch Serverless collection. Default is `false`. `serverless_options` | No | Object | The network configuration options available when the backend of the `opensearch` sink is set to Amazon OpenSearch Serverless. For more information, see [Serverless options](#serverless-options). - ## aws From 5d9edcbbeb6adc7e18a1b6de4735bfbb6e1432b2 Mon Sep 17 00:00:00 2001 From: Naveen Tatikonda Date: Fri, 29 Mar 2024 10:47:27 -0500 Subject: [PATCH 18/18] Add documentation for k-NN Faiss SQfp16 (#6249) * Add Documentation for k-NN Faiss SQFP16 Signed-off-by: Naveen Tatikonda * Address Review Comments Signed-off-by: Naveen Tatikonda * Doc review Signed-off-by: Fanit Kolchina * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Doc review Signed-off-by: Fanit Kolchina * Add sentence to choosing the right method Signed-off-by: Fanit Kolchina * Update _search-plugins/knn/knn-index.md Co-authored-by: Naveen Tatikonda Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Tech review comments Signed-off-by: Fanit Kolchina * Update _search-plugins/knn/knn-vector-quantization.md Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Add note about SIMD Signed-off-by: Fanit Kolchina * Reworded recall loss Signed-off-by: Fanit Kolchina * Reword according to tech review feedback Signed-off-by: Fanit Kolchina * Tech review comment Signed-off-by: Fanit Kolchina * Add warning about Windows Signed-off-by: Fanit Kolchina * Tech review comments Signed-off-by: Fanit Kolchina * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Define IVF Signed-off-by: Fanit Kolchina * Update _search-plugins/knn/knn-index.md Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Update _search-plugins/knn/knn-index.md Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: Naveen Tatikonda Signed-off-by: Fanit Kolchina Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Fanit Kolchina Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower --- .../styles/Vocab/OpenSearch/Words/accept.txt | 2 + _search-plugins/knn/knn-index.md | 152 ++++++++++++++--- .../knn/knn-vector-quantization.md | 159 ++++++++++++++++++ _search-plugins/knn/settings.md | 1 + 4 files changed, 291 insertions(+), 23 deletions(-) create mode 100644 _search-plugins/knn/knn-vector-quantization.md diff --git a/.github/vale/styles/Vocab/OpenSearch/Words/accept.txt b/.github/vale/styles/Vocab/OpenSearch/Words/accept.txt index 091f2d2534..4362c11798 100644 --- a/.github/vale/styles/Vocab/OpenSearch/Words/accept.txt +++ b/.github/vale/styles/Vocab/OpenSearch/Words/accept.txt @@ -81,6 +81,7 @@ Levenshtein [Oo]nboarding pebibyte [Pp]erformant +[Pp]laintext [Pp]luggable [Pp]reconfigure [Pp]refetch @@ -92,6 +93,7 @@ pebibyte [Pp]reprocess [Pp]retrain [Pp]seudocode +[Quantiz](e|ation|ing|er) [Rr]ebalance [Rr]ebalancing [Rr]edownload diff --git a/_search-plugins/knn/knn-index.md b/_search-plugins/knn/knn-index.md index 1e0c2e84f5..01b82b425b 100644 --- a/_search-plugins/knn/knn-index.md +++ b/_search-plugins/knn/knn-index.md @@ -11,10 +11,65 @@ has_children: false The k-NN plugin introduces a custom data type, the `knn_vector`, that allows users to ingest their k-NN vectors into an OpenSearch index and perform different kinds of k-NN search. The `knn_vector` field is highly configurable and can serve many different k-NN workloads. For more information, see [k-NN vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector/). +To create a k-NN index, set the `settings.index.knn` parameter to `true`: + +```json +PUT /test-index +{ + "settings": { + "index": { + "knn": true + } + }, + "mappings": { + "properties": { + "my_vector1": { + "type": "knn_vector", + "dimension": 3, + "method": { + "name": "hnsw", + "space_type": "l2", + "engine": "lucene", + "parameters": { + "ef_construction": 128, + "m": 24 + } + } + } + } + } +} +``` +{% include copy-curl.html %} + ## Lucene byte vector Starting with k-NN plugin version 2.9, you can use `byte` vectors with the `lucene` engine in order to reduce the amount of storage space needed. For more information, see [Lucene byte vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#lucene-byte-vector). +## SIMD optimization for the Faiss engine + +Starting with version 2.13, the k-NN plugin supports [Single Instruction Multiple Data (SIMD)](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data) processing if the underlying hardware supports SIMD instructions (AVX2 on x64 architecture and Neon on ARM64 architecture). SIMD is supported by default on Linux machines only for the Faiss engine. SIMD architecture helps boost overall performance by improving indexing throughput and reducing search latency. + +SIMD optimization is applicable only if the vector dimension is a multiple of 8. +{: .note} + + +### x64 architecture + + +For the x64 architecture, two different versions of the Faiss library are built and shipped with the artifact: + +- `libopensearchknn_faiss.so`: The non-optimized Faiss library without SIMD instructions. +- `libopensearchknn_faiss_avx2.so`: The Faiss library that contains AVX2 SIMD instructions. + +If your hardware supports AVX2, the k-NN plugin loads the `libopensearchknn_faiss_avx2.so` library at runtime. + +To disable AVX2 and load the non-optimized Faiss library (`libopensearchknn_faiss.so`), specify the `knn.faiss.avx2.disabled` static setting as `true` in `opensearch.yml` (default is `false`). Note that to update a static setting, you must stop the cluster, change the setting, and restart the cluster. For more information, see [Static settings]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/index/#static-settings). + +### ARM64 architecture + +For the ARM64 architecture, only one performance-boosting Faiss library (`libopensearchknn_faiss.so`) is built and shipped. The library contains Neon SIMD instructions and cannot be disabled. + ## Method definitions A method definition refers to the underlying configuration of the approximate k-NN algorithm you want to use. Method definitions are used to either create a `knn_vector` field (when the method does not require training) or [create a model during training]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#train-model) that can then be used to [create a `knn_vector` field]({{site.url}}{{site.baseurl}}/search-plugins/knn/approximate-knn/#building-a-k-nn-index-from-a-model). @@ -48,12 +103,12 @@ For nmslib, *ef_search* is set in the [index settings](#index-settings). An index created in OpenSearch version 2.11 or earlier will still use the old `ef_construction` value (`512`). {: .note} -### Supported faiss methods +### Supported Faiss methods Method name | Requires training | Supported spaces | Description :--- | :--- | :--- | :--- `hnsw` | false | l2, innerproduct | Hierarchical proximity graph approach to approximate k-NN search. -`ivf` | true | l2, innerproduct | Bucketing approach where vectors are assigned different buckets based on clustering and, during search, only a subset of the buckets is searched. +`ivf` | true | l2, innerproduct | Stands for _inverted file index_. Bucketing approach where vectors are assigned different buckets based on clustering and, during search, only a subset of the buckets is searched. For hnsw, "innerproduct" is not available when PQ is used. {: .note} @@ -107,25 +162,21 @@ An index created in OpenSearch version 2.11 or earlier will still use the old `e {: .note} ```json -{ - "type": "knn_vector", - "dimension": 100, - "method": { - "name":"hnsw", - "engine":"lucene", - "space_type": "l2", - "parameters":{ - "m":2048, - "ef_construction": 245 - } +"method": { + "name":"hnsw", + "engine":"lucene", + "space_type": "l2", + "parameters":{ + "m":2048, + "ef_construction": 245 } } ``` -### Supported faiss encoders +### Supported Faiss encoders -You can use encoders to reduce the memory footprint of a k-NN index at the expense of search accuracy. faiss has -several encoder types, but the plugin currently only supports *flat* and *pq* encoding. +You can use encoders to reduce the memory footprint of a k-NN index at the expense of search accuracy. The k-NN plugin currently supports the +`flat`, `pq`, and `sq` encoders in the Faiss library. The following example method definition specifies the `hnsw` method and a `pq` encoder: @@ -151,11 +202,27 @@ The `hnsw` method supports the `pq` encoder for OpenSearch versions 2.10 and lat Encoder name | Requires training | Description :--- | :--- | :--- -`flat` | false | Encode vectors as floating point arrays. This encoding does not reduce memory footprint. +`flat` (Default) | false | Encode vectors as floating-point arrays. This encoding does not reduce memory footprint. `pq` | true | An abbreviation for _product quantization_, it is a lossy compression technique that uses clustering to encode a vector into a fixed size of bytes, with the goal of minimizing the drop in k-NN search accuracy. At a high level, vectors are broken up into `m` subvectors, and then each subvector is represented by a `code_size` code obtained from a code book produced during training. For more information about product quantization, see [this blog post](https://medium.com/dotstar/understanding-faiss-part-2-79d90b1e5388). +`sq` | false | An abbreviation for _scalar quantization_. Starting with k-NN plugin version 2.13, you can use the `sq` encoder to quantize 32-bit floating-point vectors into 16-bit floats. In version 2.13, the built-in `sq` encoder is the SQFP16 Faiss encoder. The encoder reduces memory footprint with a minimal loss of precision and improves performance by using SIMD optimization (using AVX2 on x86 architecture or Neon on ARM64 architecture). For more information, see [Faiss scalar quantization]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-vector-quantization#faiss-scalar-quantization). -#### Examples +#### PQ parameters + +Parameter name | Required | Default | Updatable | Description +:--- | :--- | :--- | :--- | :--- +`m` | false | 1 | false | Determines the number of subvectors into which to break the vector. Subvectors are encoded independently of each other. This vector dimension must be divisible by `m`. Maximum value is 1,024. +`code_size` | false | 8 | false | Determines the number of bits into which to encode a subvector. Maximum value is 8. For IVF, this value must be less than or equal to 8. For HNSW, this value can only be 8. + +#### SQ parameters + +Parameter name | Required | Default | Updatable | Description +:--- | :--- | :-- | :--- | :--- +`type` | false | `fp16` | false | The type of scalar quantization to be used to encode 32-bit float vectors into the corresponding type. As of OpenSearch 2.13, only the `fp16` encoder type is supported. For the `fp16` encoder, vector values must be in the [-65504.0, 65504.0] range. +`clip` | false | `false` | false | If `true`, then any vector values outside of the supported range for the specified vector type are rounded so that they are in the range. If `false`, then the request is rejected if any vector values are outside of the supported range. Setting `clip` to `true` may decrease recall. + +For more information and examples, see [Using Faiss scalar quantization]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-vector-quantization/#using-faiss-scalar-quantization). +#### Examples The following example uses the `ivf` method without specifying an encoder (by default, OpenSearch uses the `flat` encoder): @@ -204,12 +271,46 @@ The following example uses the `hnsw` method without specifying an encoder (by d } ``` -#### PQ parameters +The following example uses the `hnsw` method with an `sq` encoder of type `fp16` with `clip` enabled: -Paramater Name | Required | Default | Updatable | Description -:--- | :--- | :--- | :--- | :--- -`m` | false | 1 | false | Determines the number of subvectors into which to break the vector. Subvectors are encoded independently of each other. This dimension of the vector must be divisible by `m`. Maximum value is 1,024. -`code_size` | false | 8 | false | Determines the number of bits into which to encode a subvector. Maximum value is 8. For IVF, this value must be less than or equal to 8. For HNSW, this value can only be 8. +```json +"method": { + "name":"hnsw", + "engine":"faiss", + "space_type": "l2", + "parameters":{ + "encoder": { + "name": "sq", + "parameters": { + "type": "fp16", + "clip": true + } + }, + "ef_construction": 256, + "m": 8 + } +} +``` + +The following example uses the `ivf` method with an `sq` encoder of type `fp16`: + +```json +"method": { + "name":"ivf", + "engine":"faiss", + "space_type": "l2", + "parameters":{ + "encoder": { + "name": "sq", + "parameters": { + "type": "fp16", + "clip": false + } + }, + "nprobes": 2 + } +} +``` ### Choosing the right method @@ -221,6 +322,8 @@ If you want to use less memory and index faster than HNSW, while maintaining sim If memory is a concern, consider adding a PQ encoder to your HNSW or IVF index. Because PQ is a lossy encoding, query quality will drop. +You can reduce the memory footprint by a factor of 2, with a minimal loss in search quality, by using the [`fp_16` encoder]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-vector-quantization/#faiss-scalar-quantization). If your vector dimensions are within the [-128, 127] byte range, we recommend using the [byte quantizer]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector/#lucene-byte-vector) in order to reduce the memory footprint by a factor of 4. To learn more about vector quantization options, see [k-NN vector quantization]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-vector-quantization/). + ### Memory estimation In a typical OpenSearch cluster, a certain portion of RAM is set aside for the JVM heap. The k-NN plugin allocates @@ -230,6 +333,9 @@ the `circuit_breaker_limit` cluster setting. By default, the limit is set at 50% Having a replica doubles the total number of vectors. {: .note } +For information about using memory estimation with vector quantization, see the [vector quantization documentation]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-vector-quantization/#memory-estimation). +{: .note } + #### HNSW memory estimation The memory required for HNSW is estimated to be `1.1 * (4 * dimension + 8 * M)` bytes/vector. diff --git a/_search-plugins/knn/knn-vector-quantization.md b/_search-plugins/knn/knn-vector-quantization.md new file mode 100644 index 0000000000..3373f104c2 --- /dev/null +++ b/_search-plugins/knn/knn-vector-quantization.md @@ -0,0 +1,159 @@ +--- +layout: default +title: k-NN vector quantization +nav_order: 27 +parent: k-NN search +grand_parent: Search methods +has_children: false +has_math: true +--- + +# k-NN vector quantization + +By default, the k-NN plugin supports the indexing and querying of vectors of type `float`, where each dimension of the vector occupies 4 bytes of memory. For use cases that require ingestion on a large scale, keeping `float` vectors can be expensive because OpenSearch needs to construct, load, save, and search graphs (for native `nmslib` and `faiss` engines). To reduce the memory footprint, you can use vector quantization. + +## Lucene byte vector + +Starting with k-NN plugin version 2.9, you can use `byte` vectors with the `lucene` engine in order to reduce the amount of required memory. This requires quantizing the vectors outside of OpenSearch before ingesting them into an OpenSearch index. For more information, see [Lucene byte vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#lucene-byte-vector). + +## Faiss scalar quantization + +Starting with version 2.13, the k-NN plugin supports performing scalar quantization for the Faiss engine within OpenSearch. Within the Faiss engine, a scalar quantizer (SQfp16) performs the conversion between 32-bit and 16-bit vectors. At ingestion time, when you upload 32-bit floating-point vectors to OpenSearch, SQfp16 quantizes them into 16-bit floating-point vectors and stores the quantized vectors in a k-NN index. At search time, SQfp16 decodes the vector values back into 32-bit floating-point values for distance computation. The SQfp16 quantization can decrease the memory footprint by a factor of 2. Additionally, it leads to a minimal loss in recall when differences between vector values are large compared to the error introduced by eliminating their two least significant bits. When used with [SIMD optimization]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index#simd-optimization-for-the-faiss-engine), SQfp16 quantization can also significantly reduce search latencies and improve indexing throughput. + +SIMD optimization is not supported on Windows. Using Faiss scalar quantization on Windows can lead to a significant drop in performance, including decreased indexing throughput and increased search latencies. +{: .warning} + +### Using Faiss scalar quantization + +To use Faiss scalar quantization, set the k-NN vector field's `method.parameters.encoder.name` to `sq` when creating a k-NN index: + +```json +PUT /test-index +{ + "settings": { + "index": { + "knn": true, + "knn.algo_param.ef_search": 100 + } + }, + "mappings": { + "properties": { + "my_vector1": { + "type": "knn_vector", + "dimension": 3, + "method": { + "name": "hnsw", + "engine": "faiss", + "space_type": "l2", + "parameters": { + "encoder": { + "name": "sq", + }, + "ef_construction": 256, + "m": 8 + } + } + } + } + } +} +``` +{% include copy-curl.html %} + +Optionally, you can specify the parameters in `method.parameters.encoder`. For more information about `encoder` object parameters, see [SQ parameters]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index/#sq-parameters). + +The `fp16` encoder converts 32-bit vectors into their 16-bit counterparts. For this encoder type, the vector values must be in the [-65504.0, 65504.0] range. To define how to handle out-of-range values, the preceding request specifies the `clip` parameter. By default, this parameter is `false`, and any vectors containing out-of-range values are rejected. When `clip` is set to `true` (as in the preceding request), out-of-range vector values are rounded up or down so that they are in the supported range. For example, if the original 32-bit vector is `[65510.82, -65504.1]`, the vector will be indexed as a 16-bit vector `[65504.0, -65504.0]`. + +We recommend setting `clip` to `true` only if very few elements lie outside of the supported range. Rounding the values may cause a drop in recall. +{: .note} + +The following example method definition specifies the Faiss SQfp16 encoder, which rejects any indexing request that contains out-of-range vector values (because the `clip` parameter is `false` by default): + +```json +PUT /test-index +{ + "settings": { + "index": { + "knn": true, + "knn.algo_param.ef_search": 100 + } + }, + "mappings": { + "properties": { + "my_vector1": { + "type": "knn_vector", + "dimension": 3, + "method": { + "name": "hnsw", + "engine": "faiss", + "space_type": "l2", + "parameters": { + "encoder": { + "name": "sq", + "parameters": { + "type": "fp16" + } + }, + "ef_construction": 256, + "m": 8 + } + } + } + } + } +} +``` +{% include copy-curl.html %} + +During ingestion, make sure each dimension of the vector is in the supported range ([-65504.0, 65504.0]): + +```json +PUT test-index/_doc/1 +{ + "my_vector1": [-65504.0, 65503.845, 55.82] +} +``` +{% include copy-curl.html %} + +During querying, there is no range limitation for the query vector: + +```json +GET test-index/_search +{ + "size": 2, + "query": { + "knn": { + "my_vector1": { + "vector": [265436.876, -120906.256, 99.84], + "k": 2 + } + } + } +} +``` +{% include copy-curl.html %} + +## Memory estimation + +In the best-case scenario, 16-bit vectors produced by the Faiss SQfp16 quantizer require 50% of the memory that 32-bit vectors require. + +#### HNSW memory estimation + +The memory required for HNSW is estimated to be `1.1 * (2 * dimension + 8 * M)` bytes/vector. + +As an example, assume that you have 1 million vectors with a dimension of 256 and M of 16. The memory requirement can be estimated as follows: + +```bash +1.1 * (2 * 256 + 8 * 16) * 1,000,000 ~= 0.656 GB +``` + +#### IVF memory estimation + +The memory required for IVF is estimated to be `1.1 * (((2 * dimension) * num_vectors) + (4 * nlist * d))` bytes/vector. + +As an example, assume that you have 1 million vectors with a dimension of 256 and `nlist` of 128. The memory requirement can be estimated as follows: + +```bash +1.1 * (((2 * 256) * 1,000,000) + (4 * 128 * 256)) ~= 0.525 GB +``` + diff --git a/_search-plugins/knn/settings.md b/_search-plugins/knn/settings.md index 1f43654fbe..f4ef057cfb 100644 --- a/_search-plugins/knn/settings.md +++ b/_search-plugins/knn/settings.md @@ -25,3 +25,4 @@ Setting | Default | Description `knn.model.index.number_of_shards`| 1 | The number of shards to use for the model system index, the OpenSearch index that stores the models used for Approximate Nearest Neighbor (ANN) search. `knn.model.index.number_of_replicas`| 1 | The number of replica shards to use for the model system index. Generally, in a multi-node cluster, this should be at least 1 to increase stability. `knn.advanced.filtered_exact_search_threshold`| null | The threshold value for the filtered IDs that is used to switch to exact search during filtered ANN search. If the number of filtered IDs in a segment is less than this setting's value, exact search will be performed on the filtered IDs. +`knn.faiss.avx2.disabled` | False | A static setting that specifies whether to disable the SIMD-based `libopensearchknn_faiss_avx2.so` library and load the non-optimized `libopensearchknn_faiss.so` library for the Faiss engine on machines with x64 architecture. For more information, see [SIMD optimization for the Faiss engine]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index/#simd-optimization-for-the-faiss-engine).