diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md index 21b6fbfea62..fd4213b7e55 100644 --- a/.github/PULL_REQUEST_TEMPLATE.md +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -2,7 +2,7 @@ _Describe what this change achieves._ ### Issues Resolved -_List any issues this PR will resolve, e.g. Closes [...]._ +Closes #[_insert issue number_] ### Version _List the OpenSearch version to which this PR applies, e.g. 2.14, 2.12--2.14, or all._ diff --git a/.github/vale/styles/Vocab/OpenSearch/Words/accept.txt b/.github/vale/styles/Vocab/OpenSearch/Words/accept.txt index aa04726e42c..c6d129c2c5f 100644 --- a/.github/vale/styles/Vocab/OpenSearch/Words/accept.txt +++ b/.github/vale/styles/Vocab/OpenSearch/Words/accept.txt @@ -77,8 +77,9 @@ Levenshtein [Mm]ultivalued [Mm]ultiword [Nn]amespace -[Oo]versamples? +[Oo]ffline [Oo]nboarding +[Oo]versamples? pebibyte p\d{2} [Pp]erformant @@ -102,8 +103,10 @@ p\d{2} [Rr]eenable [Rr]eindex [Rr]eingest +[Rr]eprovision(ed|ing)? [Rr]erank(er|ed|ing)? [Rr]epo +[Rr]escor(e|ed|ing)? [Rr]ewriter [Rr]ollout [Rr]ollup @@ -127,6 +130,7 @@ stdout [Ss]ubvector [Ss]ubwords? [Ss]uperset +[Ss]uperadmins? [Ss]yslog tebibyte [Tt]emplated diff --git a/.gitignore b/.gitignore index da3cf9d1448..92f01c5fca9 100644 --- a/.gitignore +++ b/.gitignore @@ -7,3 +7,4 @@ Gemfile.lock *.iml .jekyll-cache .project +vendor/bundle diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 7afa9d7596e..8ab3c2bd4f5 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -78,6 +78,8 @@ Follow these steps to set up your local copy of the repository: 1. Navigate to your cloned repository. +##### Building using locally installed packages + 1. Install [Ruby](https://www.ruby-lang.org/en/) if you don't already have it. We recommend [RVM](https://rvm.io/), but you can use any method you prefer: ``` @@ -98,6 +100,14 @@ Follow these steps to set up your local copy of the repository: bundle install ``` +##### Building using containerization + +Assuming you have `docker-compose` installed, run the following command: + + ``` + docker compose -f docker-compose.dev.yml up + ``` + #### Troubleshooting Try the following troubleshooting steps if you encounter an error when trying to build the documentation website: diff --git a/_about/index.md b/_about/index.md index d2cc011b557..041197eeba9 100644 --- a/_about/index.md +++ b/_about/index.md @@ -22,16 +22,21 @@ This section contains documentation for OpenSearch and OpenSearch Dashboards. ## Getting started -- [Intro to OpenSearch]({{site.url}}{{site.baseurl}}/intro/) -- [Quickstart]({{site.url}}{{site.baseurl}}/quickstart/) +To get started, explore the following documentation: + +- [Getting started guide]({{site.url}}{{site.baseurl}}/getting-started/): + - [Intro to OpenSearch]({{site.url}}{{site.baseurl}}/getting-started/intro/) + - [Installation quickstart]({{site.url}}{{site.baseurl}}/getting-started/quickstart/) + - [Communicate with OpenSearch]({{site.url}}{{site.baseurl}}/getting-started/communicate/) + - [Ingest data]({{site.url}}{{site.baseurl}}/getting-started/ingest-data/) + - [Search data]({{site.url}}{{site.baseurl}}/getting-started/search-data/) + - [Getting started with OpenSearch security]({{site.url}}{{site.baseurl}}/getting-started/security/) - [Install OpenSearch]({{site.url}}{{site.baseurl}}/install-and-configure/install-opensearch/index/) - [Install OpenSearch Dashboards]({{site.url}}{{site.baseurl}}/install-and-configure/install-dashboards/index/) -- [See the FAQ](https://opensearch.org/faq) +- [FAQ](https://opensearch.org/faq) ## Why use OpenSearch? -With OpenSearch, you can perform the following use cases: - @@ -41,35 +46,38 @@ With OpenSearch, you can perform the following use cases: - - - - + + + + - + - +
Operational health tracking
Fast, Scalable Full-text SearchApplication and Infrastructure MonitoringSecurity and Event Information ManagementOperational Health TrackingFast, scalable full-text searchApplication and infrastructure monitoringSecurity and event information managementOperational health tracking
Help users find the right information within your application, website, or data lake catalog. Easily store and analyze log data, and set automated alerts for underperformance.Easily store and analyze log data, and set automated alerts for performance issues. Centralize logs to enable real-time security monitoring and forensic analysis.Use observability logs, metrics, and traces to monitor your applications and business in real time.Use observability logs, metrics, and traces to monitor your applications in real time.
-**Additional features and plugins:** +## Key features + +OpenSearch provides several features to help index, secure, monitor, and analyze your data: -OpenSearch has several features and plugins to help index, secure, monitor, and analyze your data. Most OpenSearch plugins have corresponding OpenSearch Dashboards plugins that provide a convenient, unified user interface. -- [Anomaly detection]({{site.url}}{{site.baseurl}}/monitoring-plugins/ad/) - Identify atypical data and receive automatic notifications -- [KNN]({{site.url}}{{site.baseurl}}/search-plugins/knn/) - Find “nearest neighbors” in your vector data -- [Performance Analyzer]({{site.url}}{{site.baseurl}}/monitoring-plugins/pa/) - Monitor and optimize your cluster -- [SQL]({{site.url}}{{site.baseurl}}/search-plugins/sql/index/) - Use SQL or a piped processing language to query your data -- [Index State Management]({{site.url}}{{site.baseurl}}/im-plugin/) - Automate index operations -- [ML Commons plugin]({{site.url}}{{site.baseurl}}/ml-commons-plugin/index/) - Train and execute machine-learning models -- [Asynchronous search]({{site.url}}{{site.baseurl}}/search-plugins/async/) - Run search requests in the background -- [Cross-cluster replication]({{site.url}}{{site.baseurl}}/replication-plugin/index/) - Replicate your data across multiple OpenSearch clusters +- [Anomaly detection]({{site.url}}{{site.baseurl}}/monitoring-plugins/ad/) -- Identify atypical data and receive automatic notifications. +- [SQL]({{site.url}}{{site.baseurl}}/search-plugins/sql/index/) -- Use SQL or a Piped Processing Language (PPL) to query your data. +- [Index State Management]({{site.url}}{{site.baseurl}}/im-plugin/) -- Automate index operations. +- [Search methods]({{site.url}}{{site.baseurl}}/search-plugins/knn/) -- From traditional lexical search to advanced vector and hybrid search, discover the optimal search method for your use case. +- [Machine learning]({{site.url}}{{site.baseurl}}/ml-commons-plugin/index/) -- Integrate machine learning models into your workloads. +- [Workflow automation]({{site.url}}{{site.baseurl}}/automating-configurations/index/) -- Automate complex OpenSearch setup and preprocessing tasks. +- [Performance evaluation]({{site.url}}{{site.baseurl}}/monitoring-plugins/pa/) -- Monitor and optimize your cluster. +- [Asynchronous search]({{site.url}}{{site.baseurl}}/search-plugins/async/) -- Run search requests in the background. +- [Cross-cluster replication]({{site.url}}{{site.baseurl}}/replication-plugin/index/) -- Replicate your data across multiple OpenSearch clusters. ## The secure path forward -OpenSearch includes a demo configuration so that you can get up and running quickly, but before using OpenSearch in a production environment, you must [configure the Security plugin manually]({{site.url}}{{site.baseurl}}/security/configuration/index/) with your own certificates, authentication method, users, and passwords. + +OpenSearch includes a demo configuration so that you can get up and running quickly, but before using OpenSearch in a production environment, you must [configure the Security plugin manually]({{site.url}}{{site.baseurl}}/security/configuration/index/) with your own certificates, authentication method, users, and passwords. To get started, see [Getting started with OpenSearch security]({{site.url}}{{site.baseurl}}/getting-started/security/). ## Looking for the Javadoc? diff --git a/_about/version-history.md b/_about/version-history.md index fd635aff5b3..47253558e94 100644 --- a/_about/version-history.md +++ b/_about/version-history.md @@ -9,6 +9,7 @@ permalink: /version-history/ OpenSearch version | Release highlights | Release date :--- | :--- | :--- +[2.17.0](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-2.17.0.md) | Includes disk-optimized vector search, binary quantization, and byte vector encoding in k-NN. Adds asynchronous batch ingestion for ML tasks. Provides search and query performance enhancements and a new custom trace source in trace analytics. Includes application-based configuration templates. For a full list of release highlights, see the Release Notes. | 17 September 2024 [2.16.0](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-2.16.0.md) | Includes built-in byte vector quantization and binary vector support in k-NN. Adds new sort, split, and ML inference search processors for search pipelines. Provides application-based configuration templates and additional plugins to integrate multiple data sources in OpenSearch Dashboards. Includes an experimental Batch Predict ML Commons API. For a full list of release highlights, see the Release Notes. | 06 August 2024 [2.15.0](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-2.15.0.md) | Includes parallel ingestion processing, SIMD support for exact search, and the ability to disable doc values for the k-NN field. Adds wildcard and derived field types. Improves performance for single-cardinality aggregations, rolling upgrades to remote-backed clusters, and more metrics for top N queries. For a full list of release highlights, see the Release Notes. | 25 June 2024 [2.14.0](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-2.14.0.md) | Includes performance improvements to hybrid search and date histogram queries with multi-range traversal, ML model integration within the Ingest API, semantic cache for LangChain applications, low-level vector query interface for neural sparse queries, and improved k-NN search filtering. Provides an experimental tiered cache feature. For a full list of release highlights, see the Release Notes. | 14 May 2024 diff --git a/_analyzers/index.md b/_analyzers/index.md index 95f97ec8ce8..9b999e5c3d8 100644 --- a/_analyzers/index.md +++ b/_analyzers/index.md @@ -170,6 +170,28 @@ The response provides information about the analyzers for each field: } ``` +## Normalizers +Tokenization divides text into individual terms, but it does not address variations in token forms. Normalization resolves these issues by converting tokens into a standard format. This ensures that similar terms are matched appropriately, even if they are not identical. + +### Normalization techniques + +The following normalization techniques can help address variations in token forms: +1. **Case normalization**: Converts all tokens to lowercase to ensure case-insensitive matching. For example, "Hello" is normalized to "hello". + +2. **Stemming**: Reduces words to their root form. For instance, "cars" is stemmed to "car", and "running" is normalized to "run". + +3. **Synonym handling:** Treats synonyms as equivalent. For example, "jogging" and "running" can be indexed under a common term, such as "run". + +### Normalization + +A search for `Hello` will match documents containing `hello` because of case normalization. + +A search for `cars` will also match documents containing `car` because of stemming. + +A query for `running` can retrieve documents containing `jogging` using synonym handling. + +Normalization ensures that searches are not limited to exact term matches, allowing for more relevant results. For instance, a search for `Cars running` can be normalized to match `car run`. + ## Next steps -- Learn more about specifying [index analyzers]({{site.url}}{{site.baseurl}}/analyzers/index-analyzers/) and [search analyzers]({{site.url}}{{site.baseurl}}/analyzers/search-analyzers/). \ No newline at end of file +- Learn more about specifying [index analyzers]({{site.url}}{{site.baseurl}}/analyzers/index-analyzers/) and [search analyzers]({{site.url}}{{site.baseurl}}/analyzers/search-analyzers/). diff --git a/_analyzers/token-filters/asciifolding.md b/_analyzers/token-filters/asciifolding.md new file mode 100644 index 00000000000..d5722519889 --- /dev/null +++ b/_analyzers/token-filters/asciifolding.md @@ -0,0 +1,135 @@ +--- +layout: default +title: ASCII folding +parent: Token filters +nav_order: 20 +--- + +# ASCII folding token filter + +The `asciifolding` token filter converts non-ASCII characters to their closest ASCII equivalents. For example, *é* becomes *e*, *ü* becomes *u*, and *ñ* becomes *n*. This process is known as *transliteration*. + + +The `asciifolding` token filter offers a number of benefits: + + - **Enhanced search flexibility**: Users often omit accents or special characters when entering queries. The `asciifolding` token filter ensures that such queries still return relevant results. + - **Normalization**: Standardizes the indexing process by ensuring that accented characters are consistently converted to their ASCII equivalents. + - **Internationalization**: Particularly useful for applications including multiple languages and character sets. + +While the `asciifolding` token filter can simplify searches, it may also lead to the loss of specific information, particularly if the distinction between accented and non-accented characters in the dataset is significant. +{: .warning} + +## Parameters + +You can configure the `asciifolding` token filter using the `preserve_original` parameter. Setting this parameter to `true` keeps both the original token and its ASCII-folded version in the token stream. This can be particularly useful when you want to match both the original (with accents) and the normalized (without accents) versions of a term in a search query. Default is `false`. + +## Example + +The following example request creates a new index named `example_index` and defines an analyzer with the `asciifolding` filter and `preserve_original` parameter set to `true`: + +```json +PUT /example_index +{ + "settings": { + "analysis": { + "filter": { + "custom_ascii_folding": { + "type": "asciifolding", + "preserve_original": true + } + }, + "analyzer": { + "custom_ascii_analyzer": { + "type": "custom", + "tokenizer": "standard", + "filter": [ + "lowercase", + "custom_ascii_folding" + ] + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /example_index/_analyze +{ + "analyzer": "custom_ascii_analyzer", + "text": "Résumé café naïve coördinate" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "resume", + "start_offset": 0, + "end_offset": 6, + "type": "", + "position": 0 + }, + { + "token": "résumé", + "start_offset": 0, + "end_offset": 6, + "type": "", + "position": 0 + }, + { + "token": "cafe", + "start_offset": 7, + "end_offset": 11, + "type": "", + "position": 1 + }, + { + "token": "café", + "start_offset": 7, + "end_offset": 11, + "type": "", + "position": 1 + }, + { + "token": "naive", + "start_offset": 12, + "end_offset": 17, + "type": "", + "position": 2 + }, + { + "token": "naïve", + "start_offset": 12, + "end_offset": 17, + "type": "", + "position": 2 + }, + { + "token": "coordinate", + "start_offset": 18, + "end_offset": 28, + "type": "", + "position": 3 + }, + { + "token": "coördinate", + "start_offset": 18, + "end_offset": 28, + "type": "", + "position": 3 + } + ] +} +``` + + diff --git a/_analyzers/token-filters/cjk-bigram.md b/_analyzers/token-filters/cjk-bigram.md new file mode 100644 index 00000000000..ab21549c47d --- /dev/null +++ b/_analyzers/token-filters/cjk-bigram.md @@ -0,0 +1,160 @@ +--- +layout: default +title: CJK bigram +parent: Token filters +nav_order: 30 +--- + +# CJK bigram token filter + +The `cjk_bigram` token filter is designed specifically for processing East Asian languages, such as Chinese, Japanese, and Korean (CJK), which typically don't use spaces to separate words. A bigram is a sequence of two adjacent elements in a string of tokens, which can be characters or words. For CJK languages, bigrams help approximate word boundaries and capture significant character pairs that can convey meaning. + + +## Parameters + +The `cjk_bigram` token filter can be configured with two parameters: `ignore_scripts`and `output_unigrams`. + +### `ignore_scripts` + +The `cjk-bigram` token filter ignores all non-CJK scripts (writing systems like Latin or Cyrillic) and tokenizes only CJK text into bigrams. Use this option to specify CJK scripts to be ignored. This option takes the following valid values: + +- `han`: The `han` script processes Han characters. [Han characters](https://simple.wikipedia.org/wiki/Chinese_characters) are logograms used in the written languages of China, Japan, and Korea. The filter can help with text processing tasks like tokenizing, normalizing, or stemming text written in Chinese, Japanese kanji, or Korean Hanja. + +- `hangul`: The `hangul` script processes Hangul characters, which are unique to the Korean language and do not exist in other East Asian scripts. + +- `hiragana`: The `hiragana` script processes hiragana, one of the two syllabaries used in the Japanese writing system. + Hiragana is typically used for native Japanese words, grammatical elements, and certain forms of punctuation. + +- `katakana`: The `katakana` script processes katakana, the other Japanese syllabary. + Katakana is mainly used for foreign loanwords, onomatopoeia, scientific names, and certain Japanese words. + + +### `output_unigrams` + +This option, when set to `true`, outputs both unigrams (single characters) and bigrams. Default is `false`. + +## Example + +The following example request creates a new index named `devanagari_example_index` and defines an analyzer with the `cjk_bigram_filter` filter and `ignored_scripts` parameter set to `katakana`: + +```json +PUT /cjk_bigram_example +{ + "settings": { + "analysis": { + "analyzer": { + "cjk_bigrams_no_katakana": { + "tokenizer": "standard", + "filter": [ "cjk_bigrams_no_katakana_filter" ] + } + }, + "filter": { + "cjk_bigrams_no_katakana_filter": { + "type": "cjk_bigram", + "ignored_scripts": [ + "katakana" + ], + "output_unigrams": true + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /cjk_bigram_example/_analyze +{ + "analyzer": "cjk_bigrams_no_katakana", + "text": "東京タワーに行く" +} +``` +{% include copy-curl.html %} + +Sample text: "東京タワーに行く" + + 東京 (Kanji for "Tokyo") + タワー (Katakana for "Tower") + に行く (Hiragana and Kanji for "go to") + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "東", + "start_offset": 0, + "end_offset": 1, + "type": "", + "position": 0 + }, + { + "token": "東京", + "start_offset": 0, + "end_offset": 2, + "type": "", + "position": 0, + "positionLength": 2 + }, + { + "token": "京", + "start_offset": 1, + "end_offset": 2, + "type": "", + "position": 1 + }, + { + "token": "タワー", + "start_offset": 2, + "end_offset": 5, + "type": "", + "position": 2 + }, + { + "token": "に", + "start_offset": 5, + "end_offset": 6, + "type": "", + "position": 3 + }, + { + "token": "に行", + "start_offset": 5, + "end_offset": 7, + "type": "", + "position": 3, + "positionLength": 2 + }, + { + "token": "行", + "start_offset": 6, + "end_offset": 7, + "type": "", + "position": 4 + }, + { + "token": "行く", + "start_offset": 6, + "end_offset": 8, + "type": "", + "position": 4, + "positionLength": 2 + }, + { + "token": "く", + "start_offset": 7, + "end_offset": 8, + "type": "", + "position": 5 + } + ] +} +``` + + diff --git a/_analyzers/token-filters/cjk-width.md b/_analyzers/token-filters/cjk-width.md new file mode 100644 index 00000000000..4960729cd1f --- /dev/null +++ b/_analyzers/token-filters/cjk-width.md @@ -0,0 +1,96 @@ +--- +layout: default +title: CJK width +parent: Token filters +nav_order: 40 +--- + +# CJK width token filter + +The `cjk_width` token filter normalizes Chinese, Japanese, and Korean (CJK) tokens by converting full-width ASCII characters to their standard (half-width) ASCII equivalents and half-width katakana characters to their full-width equivalents. + +### Converting full-width ASCII characters + +In CJK texts, ASCII characters (such as letters and numbers) can appear in full-width form, occupying the space of two half-width characters. Full-width ASCII characters are typically used in East Asian typography for alignment with the width of CJK characters. However, for the purposes of indexing and searching, these full-width characters need to be normalized to their standard (half-width) ASCII equivalents. + +The following example illustrates ASCII character normalization: + +``` + Full-Width: ABCDE 12345 + Normalized (half-width): ABCDE 12345 +``` + +### Converting half-width katakana characters + +The `cjk_width` token filter converts half-width katakana characters to their full-width counterparts, which are the standard form used in Japanese text. This normalization, illustrated in the following example, is important for consistency in text processing and searching: + + +``` + Half-Width katakana: カタカナ + Normalized (full-width) katakana: カタカナ +``` + +## Example + +The following example request creates a new index named `cjk_width_example_index` and defines an analyzer with the `cjk_width` filter: + +```json +PUT /cjk_width_example_index +{ + "settings": { + "analysis": { + "analyzer": { + "cjk_width_analyzer": { + "type": "custom", + "tokenizer": "standard", + "filter": ["cjk_width"] + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /cjk_width_example_index/_analyze +{ + "analyzer": "cjk_width_analyzer", + "text": "Tokyo 2024 カタカナ" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "Tokyo", + "start_offset": 0, + "end_offset": 5, + "type": "", + "position": 0 + }, + { + "token": "2024", + "start_offset": 6, + "end_offset": 10, + "type": "", + "position": 1 + }, + { + "token": "カタカナ", + "start_offset": 11, + "end_offset": 15, + "type": "", + "position": 2 + } + ] +} +``` diff --git a/_analyzers/token-filters/index.md b/_analyzers/token-filters/index.md index f4e9c434e74..86925123b83 100644 --- a/_analyzers/token-filters/index.md +++ b/_analyzers/token-filters/index.md @@ -14,9 +14,9 @@ The following table lists all token filters that OpenSearch supports. Token filter | Underlying Lucene token filter| Description [`apostrophe`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/apostrophe/) | [ApostropheFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/tr/ApostropheFilter.html) | In each token containing an apostrophe, the `apostrophe` token filter removes the apostrophe itself and all characters following it. -`asciifolding` | [ASCIIFoldingFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html) | Converts alphabetic, numeric, and symbolic characters. +[`asciifolding`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/asciifolding/) | [ASCIIFoldingFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html) | Converts alphabetic, numeric, and symbolic characters. `cjk_bigram` | [CJKBigramFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html) | Forms bigrams of Chinese, Japanese, and Korean (CJK) tokens. -`cjk_width` | [CJKWidthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html) | Normalizes Chinese, Japanese, and Korean (CJK) tokens according to the following rules:
- Folds full-width ASCII character variants into the equivalent basic Latin characters.
- Folds half-width Katakana character variants into the equivalent Kana characters. +[`cjk_width`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/cjk-width/) | [CJKWidthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html) | Normalizes Chinese, Japanese, and Korean (CJK) tokens according to the following rules:
- Folds full-width ASCII character variants into their equivalent basic Latin characters.
- Folds half-width katakana character variants into their equivalent kana characters. `classic` | [ClassicFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/classic/ClassicFilter.html) | Performs optional post-processing on the tokens generated by the classic tokenizer. Removes possessives (`'s`) and removes `.` from acronyms. `common_grams` | [CommonGramsFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/commongrams/CommonGramsFilter.html) | Generates bigrams for a list of frequently occurring terms. The output contains both single terms and bigrams. `conditional` | [ConditionalTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ConditionalTokenFilter.html) | Applies an ordered list of token filters to tokens that match the conditions provided in a script. diff --git a/_api-reference/cat/cat-shards.md b/_api-reference/cat/cat-shards.md index b07f11aca3f..56817936a63 100644 --- a/_api-reference/cat/cat-shards.md +++ b/_api-reference/cat/cat-shards.md @@ -33,6 +33,7 @@ Parameter | Type | Description bytes | Byte size | Specify the units for byte size. For example, `7kb` or `6gb`. For more information, see [Supported units]({{site.url}}{{site.baseurl}}/opensearch/units/). local | Boolean | Whether to return information from the local node only instead of from the cluster manager node. Default is `false`. cluster_manager_timeout | Time | The amount of time to wait for a connection to the cluster manager node. Default is 30 seconds. +cancel_after_time_interval | Time | The amount of time after which the shard request will be canceled. Default is `-1`. time | Time | Specify the units for time. For example, `5d` or `7h`. For more information, see [Supported units]({{site.url}}{{site.baseurl}}/opensearch/units/). ## Example requests diff --git a/_api-reference/document-apis/bulk-streaming.md b/_api-reference/document-apis/bulk-streaming.md new file mode 100644 index 00000000000..7d05e93c8a8 --- /dev/null +++ b/_api-reference/document-apis/bulk-streaming.md @@ -0,0 +1,81 @@ +--- +layout: default +title: Streaming bulk +parent: Document APIs +nav_order: 25 +redirect_from: + - /opensearch/rest-api/document-apis/bulk/streaming/ +--- + +# Streaming bulk +**Introduced 2.17.0** +{: .label .label-purple } + +This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/OpenSearch/issues/9065). +{: .warning} + +The streaming bulk operation lets you add, update, or delete multiple documents by streaming the request and getting the results as a streaming response. In comparison to the traditional [Bulk API]({{site.url}}{{site.baseurl}}/api-reference/document-apis/bulk/), streaming ingestion eliminates the need to estimate the batch size (which is affected by the cluster operational state at any given time) and naturally applies backpressure between many clients and the cluster. The streaming works over HTTP/2 or HTTP/1.1 (using chunked transfer encoding), depending on the capabilities of the clients and the cluster. + +The default HTTP transport method does not support streaming. You must install the [`transport-reactor-netty4`]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/network-settings/#selecting-the-transport) HTTP transport plugin and use it as the default HTTP transport layer. Both the `transport-reactor-netty4` plugin and the Streaming Bulk API are experimental. +{: .note} + +## Path and HTTP methods + +```json +POST _bulk/stream +POST /_bulk/stream +``` + +If you specify the index in the path, then you don't need to include it in the [request body chunks]({{site.url}}{{site.baseurl}}/api-reference/document-apis/bulk/#request-body). + +OpenSearch also accepts PUT requests to the `_bulk/stream` path, but we highly recommend using POST. The accepted usage of PUT---adding or replacing a single resource on a given path---doesn't make sense for streaming bulk requests. +{: .note } + + +## Query parameters + +The following table lists the available query parameters. All query parameters are optional. + +Parameter | Data type | Description +:--- | :--- | :--- +`pipeline` | String | The pipeline ID for preprocessing documents. +`refresh` | Enum | Whether to refresh the affected shards after performing the indexing operations. Default is `false`. `true` causes the changes show up in search results immediately but degrades cluster performance. `wait_for` waits for a refresh. Requests take longer to return, but cluster performance isn't degraded. +`require_alias` | Boolean | Set to `true` to require that all actions target an index alias rather than an index. Default is `false`. +`routing` | String | Routes the request to the specified shard. +`timeout` | Time | How long to wait for the request to return. Default is `1m`. +`type` | String | (Deprecated) The default document type for documents that don't specify a type. Default is `_doc`. We highly recommend ignoring this parameter and using the `_doc` type for all indexes. +`wait_for_active_shards` | String | Specifies the number of active shards that must be available before OpenSearch processes the bulk request. Default is `1` (only the primary shard). Set to `all` or a positive integer. Values greater than 1 require replicas. For example, if you specify a value of 3, the index must have 2 replicas distributed across 2 additional nodes in order for the request to succeed. +`batch_interval` | Time | Specifies for how long bulk operations should be accumulated into a batch before sending the batch to data nodes. +`batch_size` | Time | Specifies how many bulk operations should be accumulated into a batch before sending the batch to data nodes. Default is `1`. +{% comment %}_source | List | asdf +`_source_excludes` | List | asdf +`_source_includes` | List | asdf{% endcomment %} + +## Request body + +The Streaming Bulk API request body is fully compatible with the [Bulk API request body]({{site.url}}{{site.baseurl}}/api-reference/document-apis/bulk/#request-body), where each bulk operation (create/index/update/delete) is sent as a separate chunk. + +## Example request + +```json +curl -X POST "http://localhost:9200/_bulk/stream" -H "Transfer-Encoding: chunked" -H "Content-Type: application/json" -d' +{ "delete": { "_index": "movies", "_id": "tt2229499" } } +{ "index": { "_index": "movies", "_id": "tt1979320" } } +{ "title": "Rush", "year": 2013 } +{ "create": { "_index": "movies", "_id": "tt1392214" } } +{ "title": "Prisoners", "year": 2013 } +{ "update": { "_index": "movies", "_id": "tt0816711" } } +{ "doc" : { "title": "World War Z" } } +' +``` +{% include copy.html %} + +## Example response + +Depending on the batch settings, each streamed response chunk may report the results of one or many (batch) bulk operations. For example, for the preceding request with no batching (default), the streaming response may appear as follows: + +```json +{"took": 11, "errors": false, "items": [ { "index": {"_index": "movies", "_id": "tt1979320", "_version": 1, "result": "created", "_shards": { "total": 2 "successful": 1, "failed": 0 }, "_seq_no": 1, "_primary_term": 1, "status": 201 } } ] } +{"took": 2, "errors": true, "items": [ { "create": { "_index": "movies", "_id": "tt1392214", "status": 409, "error": { "type": "version_conflict_engine_exception", "reason": "[tt1392214]: version conflict, document already exists (current version [1])", "index": "movies", "shard": "0", "index_uuid": "yhizhusbSWmP0G7OJnmcLg" } } } ] } +{"took": 4, "errors": true, "items": [ { "update": { "_index": "movies", "_id": "tt0816711", "status": 404, "error": { "type": "document_missing_exception", "reason": "[_doc][tt0816711]: document missing", "index": "movies", "shard": "0", "index_uuid": "yhizhusbSWmP0G7OJnmcLg" } } } ] } +``` diff --git a/_api-reference/document-apis/bulk.md b/_api-reference/document-apis/bulk.md index 0475aa573d5..4add60ee373 100644 --- a/_api-reference/document-apis/bulk.md +++ b/_api-reference/document-apis/bulk.md @@ -53,16 +53,16 @@ All bulk URL parameters are optional. Parameter | Type | Description :--- | :--- | :--- pipeline | String | The pipeline ID for preprocessing documents. -refresh | Enum | Whether to refresh the affected shards after performing the indexing operations. Default is `false`. `true` makes the changes show up in search results immediately, but hurts cluster performance. `wait_for` waits for a refresh. Requests take longer to return, but cluster performance doesn't suffer. +refresh | Enum | Whether to refresh the affected shards after performing the indexing operations. Default is `false`. `true` causes the changes show up in search results immediately but degrades cluster performance. `wait_for` waits for a refresh. Requests take longer to return, but cluster performance isn't degraded. require_alias | Boolean | Set to `true` to require that all actions target an index alias rather than an index. Default is `false`. routing | String | Routes the request to the specified shard. -timeout | Time | How long to wait for the request to return. Default `1m`. -type | String | (Deprecated) The default document type for documents that don't specify a type. Default is `_doc`. We highly recommend ignoring this parameter and using a type of `_doc` for all indexes. -wait_for_active_shards | String | Specifies the number of active shards that must be available before OpenSearch processes the bulk request. Default is 1 (only the primary shard). Set to `all` or a positive integer. Values greater than 1 require replicas. For example, if you specify a value of 3, the index must have two replicas distributed across two additional nodes for the request to succeed. +timeout | Time | How long to wait for the request to return. Default is `1m`. +type | String | (Deprecated) The default document type for documents that don't specify a type. Default is `_doc`. We highly recommend ignoring this parameter and using the `_doc` type for all indexes. +wait_for_active_shards | String | Specifies the number of active shards that must be available before OpenSearch processes the bulk request. Default is `1` (only the primary shard). Set to `all` or a positive integer. Values greater than 1 require replicas. For example, if you specify a value of 3, the index must have 2 replicas distributed across 2 additional nodes in order for the request to succeed. batch_size | Integer | **(Deprecated)** Specifies the number of documents to be batched and sent to an ingest pipeline to be processed together. Default is `2147483647` (documents are ingested by an ingest pipeline all at once). If the bulk request doesn't explicitly specify an ingest pipeline or the index doesn't have a default ingest pipeline, then this parameter is ignored. Only documents with `create`, `index`, or `update` actions can be grouped into batches. {% comment %}_source | List | asdf -_source_excludes | list | asdf -_source_includes | list | asdf{% endcomment %} +_source_excludes | List | asdf +_source_includes | List | asdf{% endcomment %} ## Request body diff --git a/_api-reference/index-apis/create-index-template.md b/_api-reference/index-apis/create-index-template.md index 2a92e3f4c48..ea71126210d 100644 --- a/_api-reference/index-apis/create-index-template.md +++ b/_api-reference/index-apis/create-index-template.md @@ -45,7 +45,7 @@ Parameter | Type | Description `priority` | Integer | A number that determines which index templates take precedence during the creation of a new index or data stream. OpenSearch chooses the template with the highest priority. When no priority is given, the template is assigned a `0`, signifying the lowest priority. Optional. `template` | Object | The template that includes the `aliases`, `mappings`, or `settings` for the index. For more information, see [#template]. Optional. `version` | Integer | The version number used to manage index templates. Version numbers are not automatically set by OpenSearch. Optional. - +`context` | Object | (Experimental) The `context` parameter provides use-case-specific predefined templates that can be applied to an index. Among all settings and mappings declared for a template, context templates hold the highest priority. For more information, see [index-context]({{site.url}}{{site.baseurl}}/im-plugin/index-context/). ### Template diff --git a/_api-reference/index-apis/create-index.md b/_api-reference/index-apis/create-index.md index 2f4c1041bcb..7f7d26815f2 100644 --- a/_api-reference/index-apis/create-index.md +++ b/_api-reference/index-apis/create-index.md @@ -50,7 +50,7 @@ timeout | Time | How long to wait for the request to return. Default is `30s`. ## Request body -As part of your request, you can optionally specify [index settings]({{site.url}}{{site.baseurl}}/im-plugin/index-settings/), [mappings]({{site.url}}{{site.baseurl}}/field-types/index/), and [aliases]({{site.url}}{{site.baseurl}}/opensearch/index-alias/) for your newly created index. +As part of your request, you can optionally specify [index settings]({{site.url}}{{site.baseurl}}/im-plugin/index-settings/), [mappings]({{site.url}}{{site.baseurl}}/field-types/index/), [aliases]({{site.url}}{{site.baseurl}}/opensearch/index-alias/), and [index context]({{site.url}}{{site.baseurl}}/opensearch/index-context/). ## Example request diff --git a/_api-reference/index-apis/update-alias.md b/_api-reference/index-apis/update-alias.md index f32d34025e4..c069703bf38 100644 --- a/_api-reference/index-apis/update-alias.md +++ b/_api-reference/index-apis/update-alias.md @@ -10,7 +10,7 @@ nav_order: 5 **Introduced 1.0** {: .label .label-purple } -The Create or Update Alias API adds a data stream or index to an alias or updates the settings for an existing alias. For more alias API operations, see [Index aliases]({{site.url}}{{site.baseurl}}/opensearch/index-alias/). +The Create or Update Alias API adds one or more indexes to an alias or updates the settings for an existing alias. For more alias API operations, see [Index aliases]({{site.url}}{{site.baseurl}}/opensearch/index-alias/). The Create or Update Alias API is distinct from the [Alias API]({{site.url}}{{site.baseurl}}/opensearch/rest-api/alias/), which supports the addition and removal of aliases and the removal of alias indexes. In contrast, the following API only supports adding or updating an alias without updating the index itself. Each API also uses different request body parameters. {: .note} @@ -35,7 +35,7 @@ PUT /_alias | Parameter | Type | Description | :--- | :--- | :--- -| `target` | String | A comma-delimited list of data streams and indexes. Wildcard expressions (`*`) are supported. To target all data streams and indexes in a cluster, use `_all` or `*`. Optional. | +| `target` | String | A comma-delimited list of indexes. Wildcard expressions (`*`) are supported. To target all indexes in a cluster, use `_all` or `*`. Optional. | | `alias-name` | String | The alias name to be created or updated. Optional. | ## Query parameters @@ -53,7 +53,7 @@ In the request body, you can specify the index name, the alias name, and the set Field | Type | Description :--- | :--- | :--- | :--- -`index` | String | A comma-delimited list of data streams or indexes that you want to associate with the alias. If this field is set, it will override the index name specified in the URL path. +`index` | String | A comma-delimited list of indexes that you want to associate with the alias. If this field is set, it will override the index name specified in the URL path. `alias` | String | The name of the alias. If this field is set, it will override the alias name specified in the URL path. `is_write_index` | Boolean | Specifies whether the index should be a write index. An alias can only have one write index at a time. If a write request is submitted to an alias that links to multiple indexes, then OpenSearch runs the request only on the write index. `routing` | String | Assigns a custom value to a shard for specific operations. diff --git a/_api-reference/snapshots/create-repository.md b/_api-reference/snapshots/create-repository.md index ca4c04114c9..34e2ea83767 100644 --- a/_api-reference/snapshots/create-repository.md +++ b/_api-reference/snapshots/create-repository.md @@ -38,11 +38,20 @@ Request parameters depend on the type of repository: `fs` or `s3`. ### Common parameters -The following table lists parameters that can be used with both the `fs` and `s3` repositories. +The following table lists parameters that can be used with both the `fs` and `s3` repositories. Request field | Description :--- | :--- `prefix_mode_verification` | When enabled, adds a hashed value of a random seed to the prefix for repository verification. For remote-store-enabled clusters, you can add the `setting.prefix_mode_verification` setting to the node attributes for the supplied repository. This field works with both new and existing repositories. Optional. +`shard_path_type` | Controls the path structure of shard-level blobs. Supported values are `FIXED`, `HASHED_PREFIX`, and `HASHED_INFIX`. For more information about each value, see [shard_path_type values](#shard_path_type-values)/. Default is `FIXED`. Optional. + +#### shard_path_type values + +The following values are supported in the `shard_path_type` setting: + +- `FIXED`: Keeps the path structure in the existing hierarchical manner, such as `//indices//0/`. +- `HASHED_PREFIX`: Prepends a hashed prefix at the start of the path for each unique shard ID, for example, `///indices//0/`. +- `HASHED_INFIX`: Appends a hashed prefix after the base path for each unique shard ID, for example, `///indices//0/`. The hash method used is `FNV_1A_COMPOSITE_1`, which uses the `FNV1a` hash function and generates a custom-encoded 64-bit hash value that scales well with most remote store options. `FNV1a` takes the most significant 6 bits to create a URL-safe Base64 character and the next 14 bits to create a binary string. ### fs repository @@ -54,6 +63,7 @@ Request field | Description `max_restore_bytes_per_sec` | The maximum rate at which snapshots restore. Default is 40 MB per second (`40m`). Optional. `max_snapshot_bytes_per_sec` | The maximum rate at which snapshots take. Default is 40 MB per second (`40m`). Optional. `remote_store_index_shallow_copy` | Boolean | Determines whether the snapshot of the remote store indexes are captured as a shallow copy. Default is `false`. +`shallow_snapshot_v2` | Boolean | Determines whether the snapshots of the remote store indexes are captured as a [shallow copy v2]({{site.url}}{{site.baseurl}}/tuning-your-cluster/availability-and-recovery/remote-store/snapshot-interoperability/#shallow-snapshot-v2). Default is `false`. `readonly` | Whether the repository is read-only. Useful when migrating from one cluster (`"readonly": false` when registering) to another cluster (`"readonly": true` when registering). Optional. @@ -73,6 +83,7 @@ Request field | Description `max_snapshot_bytes_per_sec` | The maximum rate at which snapshots take. Default is 40 MB per second (`40m`). Optional. `readonly` | Whether the repository is read-only. Useful when migrating from one cluster (`"readonly": false` when registering) to another cluster (`"readonly": true` when registering). Optional. `remote_store_index_shallow_copy` | Boolean | Whether the snapshot of the remote store indexes is captured as a shallow copy. Default is `false`. +`shallow_snapshot_v2` | Boolean | Determines whether the snapshots of the remote store indexes are captured as a [shallow copy v2]([shallow copy v2]({{site.url}}{{site.baseurl}}/tuning-your-cluster/availability-and-recovery/remote-store/snapshot-interoperability/#shallow-snapshot-v2). Default is `false`. `server_side_encryption` | Whether to encrypt snapshot files in the S3 bucket. This setting uses AES-256 with S3-managed keys. See [Protecting data using server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/dev/serv-side-encryption.html). Default is `false`. Optional. `storage_class` | Specifies the [S3 storage class](https://docs.aws.amazon.com/AmazonS3/latest/dev/storage-class-intro.html) for the snapshots files. Default is `standard`. Do not use the `glacier` and `deep_archive` storage classes. Optional. diff --git a/_api-reference/snapshots/create-snapshot.md b/_api-reference/snapshots/create-snapshot.md index d4c9ef82194..b35d1a1d0cd 100644 --- a/_api-reference/snapshots/create-snapshot.md +++ b/_api-reference/snapshots/create-snapshot.md @@ -144,4 +144,5 @@ The snapshot definition is returned. | failures | array | Failures, if any, that occured during snapshot creation. | | shards | object | Total number of shards created along with number of successful and failed shards. | | state | string | Snapshot status. Possible values: `IN_PROGRESS`, `SUCCESS`, `FAILED`, `PARTIAL`. | -| remote_store_index_shallow_copy | Boolean | Whether the snapshot of the remote store indexes is captured as a shallow copy. Default is `false`. | \ No newline at end of file +| remote_store_index_shallow_copy | Boolean | Whether the snapshots of the remote store indexes is captured as a shallow copy. Default is `false`. | +| pinned_timestamp | long | A timestamp (in milliseconds) pinned by the snapshot for the implicit locking of remote store files referenced by the snapshot. | \ No newline at end of file diff --git a/_api-reference/snapshots/get-snapshot-status.md b/_api-reference/snapshots/get-snapshot-status.md index c7f919bcb3d..9636b40d64d 100644 --- a/_api-reference/snapshots/get-snapshot-status.md +++ b/_api-reference/snapshots/get-snapshot-status.md @@ -22,8 +22,9 @@ Path parameters are optional. | Parameter | Data type | Description | :--- | :--- | :--- -| repository | String | Repository containing the snapshot. | -| snapshot | String | Snapshot to return. | +| repository | String | The repository containing the snapshot. | +| snapshot | List | The snapshot(s) to return. | +| index | List | The indexes to include in the response. | Three request variants provide flexibility: @@ -31,16 +32,23 @@ Three request variants provide flexibility: * `GET _snapshot//_status` returns all currently running snapshots in the specified repository. This is the preferred variant. -* `GET _snapshot///_status` returns detailed status information for a specific snapshot in the specified repository, regardless of whether it's currently running or not. +* `GET _snapshot///_status` returns detailed status information for a specific snapshot(s) in the specified repository, regardless of whether it's currently running. -Using the API to return state for other than currently running snapshots can be very costly for (1) machine machine resources and (2) processing time if running in the cloud. For each snapshot, each request causes file reads from all a snapshot's shards. +* `GET /_snapshot////_status` returns detailed status information only for the specified indexes in a specific snapshot in the specified repository. Note that this endpoint works only for indexes belonging to a specific snapshot. + +Snapshot API calls only work if the total number of shards across the requested resources, such as snapshots and indexes created from snapshots, is smaller than the limit specified by the following cluster setting: + +- `snapshot.max_shards_allowed_in_status_api`(Dynamic, integer): The maximum number of shards that can be included in the Snapshot Status API response. Default value is `200000`. Not applicable for [shallow snapshots v2]({{site.url}}{{site.baseurl}}/tuning-your-cluster/availability-and-recovery/remote-store/snapshot-interoperability##shallow-snapshot-v2), where the total number and sizes of files are returned as 0. + + +Using the API to return the state of snapshots that are not currently running can be very costly in terms of both machine resources and processing time when querying data in the cloud. For each snapshot, each request causes a file read of all of the snapshot's shards. {: .warning} ## Request fields | Field | Data type | Description | :--- | :--- | :--- -| ignore_unavailable | Boolean | How to handles requests for unavailable snapshots. If `false`, the request returns an error for unavailable snapshots. If `true`, the request ignores unavailable snapshots, such as those that are corrupted or temporarily cannot be returned. Defaults to `false`.| +| ignore_unavailable | Boolean | How to handle requests for unavailable snapshots and indexes. If `false`, the request returns an error for unavailable snapshots and indexes. If `true`, the request ignores unavailable snapshots and indexes, such as those that are corrupted or temporarily cannot be returned. Default is `false`.| ## Example request @@ -375,18 +383,18 @@ The `GET _snapshot/my-opensearch-repo/my-first-snapshot/_status` request returns :--- | :--- | :--- | repository | String | Name of repository that contains the snapshot. | | snapshot | String | Snapshot name. | -| uuid | String | Snapshot Universally unique identifier (UUID). | +| uuid | String | A snapshot's universally unique identifier (UUID). | | state | String | Snapshot's current status. See [Snapshot states](#snapshot-states). | | include_global_state | Boolean | Whether the current cluster state is included in the snapshot. | | shards_stats | Object | Snapshot's shard counts. See [Shard stats](#shard-stats). | -| stats | Object | Details of files included in the snapshot. `file_count`: number of files. `size_in_bytes`: total of all fie sizes. See [Snapshot file stats](#snapshot-file-stats). | +| stats | Object | Information about files included in the snapshot. `file_count`: number of files. `size_in_bytes`: total size of all files. See [Snapshot file stats](#snapshot-file-stats). | | index | list of Objects | List of objects that contain information about the indices in the snapshot. See [Index objects](#index-objects).| ##### Snapshot states | State | Description | :--- | :--- | -| FAILED | The snapshot terminated in an error and no data was stored. | +| FAILED | The snapshot terminated in an error and no data was stored. | | IN_PROGRESS | The snapshot is currently running. | | PARTIAL | The global cluster state was stored, but data from at least one shard was not stored. The `failures` property of the [Create snapshot]({{site.url}}{{site.baseurl}}/api-reference/snapshots/create-snapshot) response contains additional details. | | SUCCESS | The snapshot finished and all shards were stored successfully. | @@ -420,4 +428,4 @@ All property values are Integers. :--- | :--- | :--- | | shards_stats | Object | See [Shard stats](#shard-stats). | | stats | Object | See [Snapshot file stats](#snapshot-file-stats). | -| shards | list of Objects | List of objects containing information about the shards that include the snapshot. OpenSearch returns the following properties about the shards.

**stage**: Current state of shards in the snapshot. Shard states are:

* DONE: Number of shards in the snapshot that were successfully stored in the repository.

* FAILURE: Number of shards in the snapshot that were not successfully stored in the repository.

* FINALIZE: Number of shards in the snapshot that are in the finalizing stage of being stored in the repository.

* INIT: Number of shards in the snapshot that are in the initializing stage of being stored in the repository.

* STARTED: Number of shards in the snapshot that are in the started stage of being stored in the repository.

**stats**: See [Snapshot file stats](#snapshot-file-stats).

**total**: Total number and size of files referenced by the snapshot.

**start_time_in_millis**: Time (in milliseconds) when snapshot creation began.

**time_in_millis**: Total time (in milliseconds) that the snapshot took to complete. | +| shards | List of objects | Contains information about the shards included in the snapshot. OpenSearch returns the following properties about the shard:

**stage**: The current state of shards in the snapshot. Shard states are:

* DONE: The number of shards in the snapshot that were successfully stored in the repository.

* FAILURE: The number of shards in the snapshot that were not successfully stored in the repository.

* FINALIZE: The number of shards in the snapshot that are in the finalizing stage of being stored in the repository.

* INIT: The number of shards in the snapshot that are in the initializing stage of being stored in the repository.

* STARTED: The number of shards in the snapshot that are in the started stage of being stored in the repository.

**stats**: See [Snapshot file stats](#snapshot-file-stats).

**total**: The total number and sizes of files referenced by the snapshot.

**start_time_in_millis**: The time (in milliseconds) when snapshot creation began.

**time_in_millis**: The total amount of time (in milliseconds) that the snapshot took to complete. | diff --git a/_automating-configurations/api/create-workflow.md b/_automating-configurations/api/create-workflow.md index 83c0110ac3f..770c1a1a135 100644 --- a/_automating-configurations/api/create-workflow.md +++ b/_automating-configurations/api/create-workflow.md @@ -58,7 +58,7 @@ POST /_plugins/_flow_framework/workflow?validation=none ``` {% include copy-curl.html %} -You cannot update a full workflow once it has been provisioned, but you can update fields other than the `workflows` field, such as `name` and `description`: +In a workflow that has not been provisioned, you can update fields other than the `workflows` field. For example, you can update the `name` and `description` fields as follows: ```json PUT /_plugins/_flow_framework/workflow/?update_fields=true @@ -72,13 +72,37 @@ PUT /_plugins/_flow_framework/workflow/?update_fields=true You cannot specify both the `provision` and `update_fields` parameters at the same time. {: .note} +If a workflow has been provisioned, you can update and reprovision the full template: + +```json +PUT /_plugins/_flow_framework/workflow/?reprovision=true +{ + +} +``` + +You can add new steps to the workflow but cannot delete them. Only index setting, search pipeline, and ingest pipeline steps can currently be updated. +{: .note} + +You can create and provision a workflow using a [workflow template]({{site.url}}{{site.baseurl}}/automating-configurations/workflow-templates/) as follows: + +```json +POST /_plugins/_flow_framework/workflow?use_case=&provision=true +{ + "create_connector.credential.key" : "" +} +``` +{% include copy-curl.html %} + The following table lists the available query parameters. All query parameters are optional. User-provided parameters are only allowed if the `provision` parameter is set to `true`. | Parameter | Data type | Description | | :--- | :--- | :--- | | `provision` | Boolean | Whether to provision the workflow as part of the request. Default is `false`. | | `update_fields` | Boolean | Whether to update only the fields included in the request body. Default is `false`. | +| `reprovision` | Boolean | Whether to reprovision the entire template if it has already been provisioned. A complete template must be provided in the request body. Default is `false`. | | `validation` | String | Whether to validate the workflow. Valid values are `all` (validate the template) and `none` (do not validate the template). Default is `all`. | +| `use_case` | String | The name of the [workflow template]({{site.url}}{{site.baseurl}}/automating-configurations/workflow-templates/#supported-workflow-templates) to use when creating the workflow. | | User-provided substitution expressions | String | Parameters matching substitution expressions in the template. Only allowed if `provision` is set to `true`. Optional. If `provision` is set to `false`, you can pass these parameters in the [Provision Workflow API query parameters]({{site.url}}{{site.baseurl}}/automating-configurations/api/provision-workflow/#query-parameters). | ## Request fields @@ -89,7 +113,7 @@ The following table lists the available request fields. |:--- |:--- |:--- |:--- | |`name` |String |Required |The name of the workflow. | |`description` |String |Optional |A description of the workflow. | -|`use_case` |String |Optional | A use case, which can be used with the Search Workflow API to find related workflows. In the future, OpenSearch may provide some standard use cases to ease categorization, but currently you can use this field to specify custom values. | +|`use_case` |String |Optional | A user-provided use case, which can be used with the [Search Workflow API]({{site.url}}{{site.baseurl}}/automating-configurations/api/search-workflow/) to find related workflows. You can use this field to specify custom values. This is distinct from the `use_case` query parameter. | |`version` |Object |Optional | A key-value map with two fields: `template`, which identifies the template version, and `compatibility`, which identifies a list of minimum required OpenSearch versions. | |`workflows` |Object |Optional |A map of workflows. Presently, only the `provision` key is supported. The value for the workflow key is a key-value map that includes fields for `user_params` and lists of `nodes` and `edges`. | diff --git a/_automating-configurations/api/index.md b/_automating-configurations/api/index.md index 716e19c41f8..78bc4eaede5 100644 --- a/_automating-configurations/api/index.md +++ b/_automating-configurations/api/index.md @@ -18,4 +18,6 @@ OpenSearch supports the following workflow APIs: * [Search workflow]({{site.url}}{{site.baseurl}}/automating-configurations/api/search-workflow/) * [Search workflow state]({{site.url}}{{site.baseurl}}/automating-configurations/api/search-workflow-state/) * [Deprovision workflow]({{site.url}}{{site.baseurl}}/automating-configurations/api/deprovision-workflow/) -* [Delete workflow]({{site.url}}{{site.baseurl}}/automating-configurations/api/delete-workflow/) \ No newline at end of file +* [Delete workflow]({{site.url}}{{site.baseurl}}/automating-configurations/api/delete-workflow/) + +For information about workflow access control, see [Workflow template security]({{site.url}}{{site.baseurl}}/automating-configurations/workflow-security/). \ No newline at end of file diff --git a/_automating-configurations/index.md b/_automating-configurations/index.md index 144ad445c84..68742f61490 100644 --- a/_automating-configurations/index.md +++ b/_automating-configurations/index.md @@ -44,3 +44,4 @@ Workflow automation provides the following benefits: - For the workflow step syntax, see [Workflow steps]({{site.url}}{{site.baseurl}}/automating-configurations/workflow-steps/). - For a complete example, see [Workflow tutorial]({{site.url}}{{site.baseurl}}/automating-configurations/workflow-tutorial/). - For configurable settings, see [Workflow settings]({{site.url}}{{site.baseurl}}/automating-configurations/workflow-settings/). +- For information about workflow access control, see [Workflow template security]({{site.url}}{{site.baseurl}}/automating-configurations/workflow-security/). \ No newline at end of file diff --git a/_automating-configurations/workflow-security.md b/_automating-configurations/workflow-security.md new file mode 100644 index 00000000000..f3a3d7eeb9a --- /dev/null +++ b/_automating-configurations/workflow-security.md @@ -0,0 +1,93 @@ +--- +layout: default +title: Workflow template security +nav_order: 50 +--- + +# Workflow template security + +In OpenSearch, automated workflow configurations are provided by the Flow Framework plugin. You can use the Security plugin together with the Flow Framework plugin to limit non-admin users to specific actions. For example, you might want some users to only be able to create, update, or delete workflows, while others may only be able to view workflows. + +All Flow Framework indexes are protected as system indexes. Only a superadmin user or an admin user with a TLS certificate can access system indexes. For more information, see [System indexes]({{site.url}}{{site.baseurl}}/security/configuration/system-indices/). + +Security for Flow Framework is set up similarly to [security for anomaly detection]({{site.url}}{{site.baseurl}}/monitoring-plugins/ad/security/). + +## Basic permissions + +As an admin user, you can use the Security plugin to assign specific permissions to users based on the APIs they need to access. For a list of supported Flow Framework APIs, see [Workflow APIs]({{site.url}}{{site.baseurl}}/automating-configurations/api/index/). + +The Security plugin has two built-in roles that cover most Flow Framework use cases: `flow_framework_full_access` and `flow_framework_read_access`. For descriptions of each, see [Predefined roles]({{site.url}}{{site.baseurl}}/security/access-control/users-roles#predefined-roles). + +If these roles don't meet your needs, you can assign users individual Flow Framework [permissions]({{site.url}}{{site.baseurl}}/security/access-control/permissions/) to suit your use case. Each action corresponds to an operation in the REST API. For example, the `cluster:admin/opensearch/flow_framework/workflow/search` permission lets you search workflows. + +### Fine-grained access control + +To reduce the chances of unintended users viewing metadata that describes an index, we recommend that administrators enable role-based access control when assigning permissions to the intended user group. For more information, see [Limit access by backend role](#advanced-limit-access-by-backend-role). + +## (Advanced) Limit access by backend role + +Use backend roles to configure fine-grained access to individual workflows based on roles. For example, users in different departments of an organization can view workflows owned by their own department. + +First, make sure your users have the appropriate [backend roles]({{site.url}}{{site.baseurl}}/security/access-control/index/). Backend roles usually come from an [LDAP server]({{site.url}}{{site.baseurl}}/security/configuration/ldap/) or [SAML provider]({{site.url}}{{site.baseurl}}/security/configuration/saml/), but if you use an internal user database, you can [create users manually using the API]({{site.url}}{{site.baseurl}}/security/access-control/api#create-user). + +Next, enable the following setting: + +```json +PUT _cluster/settings +{ + "transient": { + "plugins.flow_framework.filter_by_backend_roles": "true" + } +} +``` +{% include copy-curl.html %} + +Now when users view workflow resources in OpenSearch Dashboards (or make REST API calls), they only see workflows created by users who share at least one backend role. + +For example, consider two users: `alice` and `bob`. + +`alice` has an `analyst` backend role: + +```json +PUT _plugins/_security/api/internalusers/alice +{ + "password": "alice", + "backend_roles": [ + "analyst" + ], + "attributes": {} +} +``` + +`bob` has a `human-resources` backend role: + +```json +PUT _plugins/_security/api/internalusers/bob +{ + "password": "bob", + "backend_roles": [ + "human-resources" + ], + "attributes": {} +} +``` + +Both `alice` and `bob` have full access to the Flow Framework APIs: + +```json +PUT _plugins/_security/api/rolesmapping/flow_framework_full_access +{ + "backend_roles": [], + "hosts": [], + "users": [ + "alice", + "bob" + ] +} +``` + +Because they have different backend roles, `alice` and `bob` cannot view each other's workflows or their results. + +Users without backend roles can still view other users' workflow results if they have `flow_framework_read_access`. This also applies to users who have `flow_framework_full_access` because this permission includes all of the permissions of `flow_framework_read_access`. + +Administrators should inform users that the `flow_framework_read_access` permission allows them to view the results of any workflow in a cluster, including data not directly accessible to them. To limit access to the results of a specific workflow, administrators should apply backend role filters when creating the workflow. This ensures that only users with matching backend roles can access that workflow's results. \ No newline at end of file diff --git a/_automating-configurations/workflow-steps.md b/_automating-configurations/workflow-steps.md index 43685a957ac..0f61874be67 100644 --- a/_automating-configurations/workflow-steps.md +++ b/_automating-configurations/workflow-steps.md @@ -75,9 +75,11 @@ You can include the following additional fields in the `user_inputs` field if th You can include the following additional fields in the `previous_node_inputs` field when indicated. -|Field |Data type |Description | -|--- |--- |--- | -|`model_id` |String |The `model_id` is used as an input for several steps. As a special case for the Register Agent step type, if an `llm.model_id` field is not present in the `user_inputs` and not present in `previous_node_inputs`, the `model_id` field from the previous node may be used as a backup for the model ID. | +| Field |Data type | Description | +|-----------------|--- |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `model_id` |String | The `model_id` is used as an input for several steps. As a special case for the `register_agent` step type, if an `llm.model_id` field is not present in the `user_inputs` and not present in `previous_node_inputs`, then the `model_id` field from the previous node may be used as a backup for the model ID. The `model_id` will also be included in the `parameters` input of the `create_tool` step for the `MLModelTool`. | +| `agent_id` |String | The `agent_id` is used as an input for several steps. The `agent_id` will also be included in the `parameters` input of the `create_tool` step for the `AgentTool`. | +| `connector_id` |String | The `connector_id` is used as an input for several steps. The `connector_id` will also be included in the `parameters` input of the `create_tool` step for the `ConnectorTool`. | ## Example workflow steps diff --git a/_benchmark/glossary.md b/_benchmark/glossary.md new file mode 100644 index 00000000000..f86591d3d9e --- /dev/null +++ b/_benchmark/glossary.md @@ -0,0 +1,21 @@ +--- +layout: default +title: Glossary +nav_order: 10 +--- + +# OpenSearch Benchmark glossary + +The following terms are commonly used in OpenSearch Benchmark: + +- **Corpora**: A collection of documents. +- **Latency**: If `target-throughput` is disabled (has no value or a value of `0)`, then latency is equal to service time. If `target-throughput` is enabled (has a value of 1 or greater), then latency is equal to the service time plus the amount of time the request waits in the queue before being sent. +- **Metric keys**: The metrics stored by OpenSearch Benchmark, based on the configuration in the [metrics record]({{site.url}}{{site.baseurl}}/benchmark/metrics/metric-records/). +- **Operations**: In workloads, a list of API operations performed by a workload. +- **Pipeline**: A series of steps occurring both before and after running a workload that determines benchmark results. +- **Schedule**: A list of two or more operations performed in the order they appear when a workload is run. +- **Service time**: The amount of time taken for `opensearch-py`, the primary client for OpenSearch Benchmark, to send a request and receive a response from the OpenSearch cluster. It includes the amount of time taken for the server to process a request as well as for network latency, load balancer overhead, and deserialization/serialization. +- **Summary report**: A report generated at the end of a test based on the metric keys defined in the workload. +- **Test**: A single invocation of the OpenSearch Benchmark binary. +- **Throughput**: The number of operations completed in a given period of time. +- **Workload**: A collection of one or more benchmarking tests that use a specific document corpus to perform a benchmark against a cluster. The document corpus contains any indexes, data files, or operations invoked when the workload runs. \ No newline at end of file diff --git a/_benchmark/quickstart.md b/_benchmark/quickstart.md index a6bcd59819a..9ab6d25c77f 100644 --- a/_benchmark/quickstart.md +++ b/_benchmark/quickstart.md @@ -116,7 +116,7 @@ You can now run your first benchmark. The following benchmark uses the [percolat Benchmarks are run using the [`execute-test`]({{site.url}}{{site.baseurl}}/benchmark/commands/execute-test/) command with the following command flags: -For additional `execute_test` command flags, see the [execute-test]({{site.url}}{{site.baseurl}}/benchmark/commands/execute-test/) reference. Some commonly used options are `--workload-params`, `--exclude-tasks`, and `--include-tasks`. +For additional `execute-test` command flags, see the [execute-test]({{site.url}}{{site.baseurl}}/benchmark/commands/execute-test/) reference. Some commonly used options are `--workload-params`, `--exclude-tasks`, and `--include-tasks`. {: .tip} * `--pipeline=benchmark-only` : Informs OSB that users wants to provide their own OpenSearch cluster. @@ -136,7 +136,7 @@ opensearch-benchmark execute-test --pipeline=benchmark-only --workload=percolato ``` {% include copy.html %} -When the `execute_test` command runs, all tasks and operations in the `percolator` workload run sequentially. +When the `execute-test` command runs, all tasks and operations in the `percolator` workload run sequentially. ### Validating the test diff --git a/_benchmark/user-guide/creating-custom-workloads.md b/_benchmark/user-guide/creating-custom-workloads.md index ee0dca1ce97..6effa9a2657 100644 --- a/_benchmark/user-guide/creating-custom-workloads.md +++ b/_benchmark/user-guide/creating-custom-workloads.md @@ -263,7 +263,7 @@ opensearch-benchmark list workloads --workload-path= Use the `opensearch-benchmark execute-test` command to invoke your new workload and run a benchmark test against your OpenSearch cluster, as shown in the following example. Replace `--workload-path` with the path to your custom workload, `--target-host` with the `host:port` pairs for your cluster, and `--client-options` with any authorization options required to access the cluster. ``` -opensearch-benchmark execute_test \ +opensearch-benchmark execute-test \ --pipeline="benchmark-only" \ --workload-path="" \ --target-host="" \ @@ -289,7 +289,7 @@ head -n 1000 -documents.json > -documents-1k.json Then, run `opensearch-benchmark execute-test` with the option `--test-mode`. Test mode runs a quick version of the workload test. ``` -opensearch-benchmark execute_test \ +opensearch-benchmark execute-test \ --pipeline="benchmark-only" \ --workload-path="" \ --target-host="" \ diff --git a/_benchmark/user-guide/distributed-load.md b/_benchmark/user-guide/distributed-load.md index 60fc98500fd..f16de29f88b 100644 --- a/_benchmark/user-guide/distributed-load.md +++ b/_benchmark/user-guide/distributed-load.md @@ -64,7 +64,7 @@ With OpenSearch Benchmark running on all three nodes and the worker nodes set to On **Node 1**, run a benchmark test with the `worker-ips` set to the IP addresses for your worker nodes, as shown in the following example: ``` -opensearch-benchmark execute_test --pipeline=benchmark-only --workload=eventdata --worker-ips=198.52.100.0,198.53.100.0 --target-hosts= --client-options= --kill-running-processes +opensearch-benchmark execute-test --pipeline=benchmark-only --workload=eventdata --worker-ips=198.52.100.0,198.53.100.0 --target-hosts= --client-options= --kill-running-processes ``` After the test completes, the logs generated by the test appear on your worker nodes. diff --git a/_benchmark/user-guide/target-throughput.md b/_benchmark/user-guide/target-throughput.md index 63832de5951..1be0a8be39d 100644 --- a/_benchmark/user-guide/target-throughput.md +++ b/_benchmark/user-guide/target-throughput.md @@ -16,13 +16,15 @@ OpenSearch Benchmark has two testing modes, both of which are related to through ## Benchmarking mode -When you do not specify a `target-throughput`, OpenSearch Benchmark latency tests are performed in *benchmarking mode*. In this mode, the OpenSearch client sends requests to the OpenSearch cluster as fast as possible. After the cluster receives a response from the previous request, OpenSearch Benchmark immediately sends the next request to the OpenSearch client. In this testing mode, latency is identical to service time. +When `target-throughput` is set to `0`, OpenSearch Benchmark latency tests are performed in *benchmarking mode*. In this mode, the OpenSearch client sends requests to the OpenSearch cluster as fast as possible. After the cluster receives a response from the previous request, OpenSearch Benchmark immediately sends the next request to the OpenSearch client. In this testing mode, latency is identical to service time. + +OpenSearch Benchmark issues one request at a time per a single client. The number of clients is set by the `search-clients` setting in the workload parameters. ## Throughput-throttled mode -**Throughput** measures the rate at which OpenSearch Benchmark issues requests, assuming that responses will be returned instantaneously. However, users can set a `target-throughput`, which is a common workload parameter that can be set for each test and is measured in operations per second. +If the `target-throughput` is not set to `0`, then OpenSearch Benchmark issues the next request in accordance with the `target-throughput`, assuming that responses are returned instantaneously. -OpenSearch Benchmark issues one request at a time for a single-client thread, which is specified as `search-clients` in the workload parameters. If `target-throughput` is set to `0`, then OpenSearch Benchmark issues a request immediately after it receives the response from the previous request. If the `target-throughput` is not set to `0`, then OpenSearch Benchmark issues the next request in accordance with the `target-throughput`, assuming that responses are returned instantaneously. +**Throughput** measures the rate at which OpenSearch Benchmark issues requests, assuming that responses are returned instantaneously. To configure the request rate, you can set the `target-throughput` workload parameter to the desired number of operations per second for each test. When you want to simulate the type of traffic you might encounter when deploying a production cluster, set the `target-throughput` in your benchmark test to match the number of requests you estimate that the production cluster might receive. The following examples show how the `target-throughput` setting affects the latency measurement. diff --git a/_benchmark/user-guide/understanding-workloads/choosing-a-workload.md b/_benchmark/user-guide/understanding-workloads/choosing-a-workload.md index d7ae48ad0a3..ae973a7c62f 100644 --- a/_benchmark/user-guide/understanding-workloads/choosing-a-workload.md +++ b/_benchmark/user-guide/understanding-workloads/choosing-a-workload.md @@ -18,12 +18,12 @@ Consider the following criteria when deciding which workload would work best for - The cluster's use case. - The data types that your cluster uses compared to the data structure of the documents contained in the workload. Each workload contains an example document so that you can compare data types, or you can view the index mappings and data types in the `index.json` file. -- The query types most commonly used inside your cluster. The `operations/default.json` file contains information about the query types and workload operations. +- The query types most commonly used inside your cluster. The `operations/default.json` file contains information about the query types and workload operations. For a list of common operations, see [Common operations]({{site.url}}{{site.baseurl}}/benchmark/user-guide/understanding-workloads/common-operations/). ## General search clusters -For benchmarking clusters built for general search use cases, start with the `[nyc_taxis]`(https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/nyc_taxis) workload. This workload contains data about the rides taken in yellow taxis in New York City in 2015. +For benchmarking clusters built for general search use cases, start with the [nyc_taxis](https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/nyc_taxis) workload. This workload contains data about the rides taken in yellow taxis in New York City in 2015. ## Log data -For benchmarking clusters built for indexing and search with log data, use the [`http_logs`](https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/http_logs) workload. This workload contains data about the 1998 World Cup. \ No newline at end of file +For benchmarking clusters built for indexing and search with log data, use the [http_logs](https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/http_logs) workload. This workload contains data about the 1998 World Cup. diff --git a/_benchmark/user-guide/understanding-workloads/common-operations.md b/_benchmark/user-guide/understanding-workloads/common-operations.md new file mode 100644 index 00000000000..c9fe15c18c0 --- /dev/null +++ b/_benchmark/user-guide/understanding-workloads/common-operations.md @@ -0,0 +1,181 @@ +--- +layout: default +title: Common operations +nav_order: 16 +grand_parent: User guide +parent: Understanding workloads +--- + +# Common operations + +[Test procedures]({{site.url}}{{site.baseurl}}/benchmark/user-guide/understanding-workloads/anatomy-of-a-workload#_operations-and-_test-procedures) use a variety of operations, found inside the `operations` directory of a workload. This page details the most common operations found inside OpenSearch Benchmark workloads. + +- [Common operations](#common-operations) + - [bulk](#bulk) + - [create-index](#create-index) + - [delete-index](#delete-index) + - [cluster-health](#cluster-health) + - [refresh](#refresh) + - [search](#search) + + +## bulk + + +The `bulk` operation type allows you to run [bulk](/api-reference/document-apis/bulk/) requests as a task. + +The following example shows a `bulk` operation type with a `bulk-size` of `5000` documents: + +```yml +{ + "name": "index-append", + "operation-type": "bulk", + "bulk-size": 5000 +} +``` + + + +## create-index + + +The `create-index` operation runs the [Create Index API](/api-reference/index-apis/create-index/). It supports the following two modes of index creation: + +- Creating all indexes specified in the workloads `indices` section +- Creating one specific index defined within the operation itself + +The following example creates all indexes defined in the `indices` section of the workload. It uses all of the index settings defined in the workload but overrides the number of shards: + +```yml +{ + "name": "create-all-indices", + "operation-type": "create-index", + "settings": { + "index.number_of_shards": 1 + }, + "request-params": { + "wait_for_active_shards": "true" + } +} +``` + +The following example creates a new index with all index settings specified in the operation body: + +```yml +{ + "name": "create-an-index", + "operation-type": "create-index", + "index": "people", + "body": { + "settings": { + "index.number_of_shards": 0 + }, + "mappings": { + "docs": { + "properties": { + "name": { + "type": "text" + } + } + } + } + } +} +``` + + + + +## delete-index + + +The `delete-index` operation runs the [Delete Index API](api-reference/index-apis/delete-index/). Like with the [`create-index`](#create-index) operation, you can delete all indexes found in the `indices` section of the workload or delete one or more indexes based on the string passed in the `index` setting. + +The following example deletes all indexes found in the `indices` section of the workload: + +```yml +{ + "name": "delete-all-indices", + "operation-type": "delete-index" +} +``` + +The following example deletes all `logs_*` indexes: + +```yml +{ + "name": "delete-logs", + "operation-type": "delete-index", + "index": "logs-*", + "only-if-exists": false, + "request-params": { + "expand_wildcards": "all", + "allow_no_indices": "true", + "ignore_unavailable": "true" + } +} +``` + + +## cluster-health + + +The `cluster-health` operation runs the [Cluster Health API](api-reference/cluster-api/cluster-health/), which checks the cluster health status and returns the expected status according to the parameters set for `request-params`. If an unexpected cluster health status is returned, the operation reports a failure. You can use the `--on-error` option in the OpenSearch Benchmark `execute-test` command to control how OpenSearch Benchmark behaves when the health check fails. + +The following example creates a `cluster-health` operation that checks for a `green` health status on any `log-*` indexes: + +```yml +{ + "name": "check-cluster-green", + "operation-type": "cluster-health", + "index": "logs-*", + "request-params": { + "wait_for_status": "green", + "wait_for_no_relocating_shards": "true" + }, + "retry-until-success": true +} + +``` + + +## refresh + + +The `refresh` operation runs the Refresh API. The `operation` returns no metadata. + + +The following example refreshes all `logs-*` indexes: + +```yml +{ + "name": "refresh", + "operation-type": "refresh", + "index": "logs-*" +} +``` + + + +## search + + +The `search` operation runs the [Search API](/api-reference/search/), which you can use to run queries in OpenSearch Benchmark indexes. + +The following example runs a `match_all` query inside the `search` operation: + +```yml +{ + "name": "default", + "operation-type": "search", + "body": { + "query": { + "match_all": {} + } + }, + "request-params": { + "_source_include": "some_field", + "analyze_wildcard": "false" + } +} +``` diff --git a/_clients/index.md b/_clients/index.md index fc8c23d912c..a15f0539d27 100644 --- a/_clients/index.md +++ b/_clients/index.md @@ -53,8 +53,8 @@ To view the compatibility matrix for a specific client, see the `COMPATIBILITY.m Client | Recommended version :--- | :--- -[Elasticsearch Java low-level REST client](https://search.maven.org/artifact/org.elasticsearch.client/elasticsearch-rest-client/7.13.4/jar) | 7.13.4 -[Elasticsearch Java high-level REST client](https://search.maven.org/artifact/org.elasticsearch.client/elasticsearch-rest-high-level-client/7.13.4/jar) | 7.13.4 +[Elasticsearch Java low-level REST client](https://central.sonatype.com/artifact/org.elasticsearch.client/elasticsearch-rest-client/7.13.4) | 7.13.4 +[Elasticsearch Java high-level REST client](https://central.sonatype.com/artifact/org.elasticsearch.client/elasticsearch-rest-high-level-client/7.13.4) | 7.13.4 [Elasticsearch Python client](https://pypi.org/project/elasticsearch/7.13.4/) | 7.13.4 [Elasticsearch Node.js client](https://www.npmjs.com/package/@elastic/elasticsearch/v/7.13.0) | 7.13.0 [Elasticsearch Ruby client](https://rubygems.org/gems/elasticsearch/versions/7.13.0) | 7.13.0 diff --git a/_config.yml b/_config.yml index 8a43e2f61ae..4ead6344c24 100644 --- a/_config.yml +++ b/_config.yml @@ -5,9 +5,9 @@ baseurl: "/docs/latest" # the subpath of your site, e.g. /blog url: "https://opensearch.org" # the base hostname & protocol for your site, e.g. http://example.com permalink: /:path/ -opensearch_version: '2.16.0' -opensearch_dashboards_version: '2.16.0' -opensearch_major_minor_version: '2.16' +opensearch_version: '2.17.0' +opensearch_dashboards_version: '2.17.0' +opensearch_major_minor_version: '2.17' lucene_version: '9_11_1' # Build settings diff --git a/_dashboards/management/S3-data-source.md b/_dashboards/management/S3-data-source.md index 585edeac81a..1a7cc579b08 100644 --- a/_dashboards/management/S3-data-source.md +++ b/_dashboards/management/S3-data-source.md @@ -10,51 +10,41 @@ has_children: true Introduced 2.11 {: .label .label-purple } -Starting with OpenSearch 2.11, you can connect OpenSearch to your Amazon Simple Storage Service (Amazon S3) data source using the OpenSearch Dashboards UI. You can then query that data, optimize query performance, define tables, and integrate your S3 data within a single UI. +You can connect OpenSearch to your Amazon Simple Storage Service (Amazon S3) data source using the OpenSearch Dashboards interface and then query that data, optimize query performance, define tables, and integrate your S3 data. ## Prerequisites -To connect data from Amazon S3 to OpenSearch using OpenSearch Dashboards, you must have: +Before connecting a data source, verify that the following requirements are met: -- Access to Amazon S3 and the [AWS Glue Data Catalog](https://github.com/opensearch-project/sql/blob/main/docs/user/ppl/admin/connectors/s3glue_connector.rst#id2). -- Access to OpenSearch and OpenSearch Dashboards. -- An understanding of OpenSearch data source and connector concepts. See the [developer documentation](https://github.com/opensearch-project/sql/blob/main/docs/user/ppl/admin/datasources.rst#introduction) for information about these concepts. +- You have access to Amazon S3 and the [AWS Glue Data Catalog](https://github.com/opensearch-project/sql/blob/main/docs/user/ppl/admin/connectors/s3glue_connector.rst#id2). +- You have access to OpenSearch and OpenSearch Dashboards. +- You have an understanding of OpenSearch data source and connector concepts. See the [developer documentation](https://github.com/opensearch-project/sql/blob/main/docs/user/ppl/admin/datasources.rst#introduction) for more information. -## Connect your Amazon S3 data source +## Connect your data source -To connect your Amazon S3 data source, follow these steps: +To connect your data source, follow these steps: -1. From the OpenSearch Dashboards main menu, select **Management** > **Data sources**. -2. On the **Data sources** page, select **New data source** > **S3**. An example UI is shown in the following image. +1. From the OpenSearch Dashboards main menu, go to **Management** > **Dashboards Management** > **Data sources**. +2. On the **Data sources** page, select **Create data source connection** > **Amazon S3**. +3. On the **Configure Amazon S3 data source** page, enter the data source, authentication details, and permissions. +4. Select the **Review Configuration** button to verify the connection details. +5. Select the **Connect to Amazon S3** button to establish a connection. - Amazon S3 data sources UI +## Manage your data source -3. On the **Configure Amazon S3 data source** page, enter the required **Data source details**, **AWS Glue authentication details**, **AWS Glue index store details**, and **Query permissions**. An example UI is shown in the following image. - - Amazon S3 configuration UI - -4. Select the **Review Configuration** button and verify the details. -5. Select the **Connect to Amazon S3** button. - -## Manage your Amazon S3 data source - -Once you've connected your Amazon S3 data source, you can explore your data through the **Manage data sources** tab. The following steps guide you through using this functionality: +To manage your data source, follow these steps: 1. On the **Manage data sources** tab, choose a date source from the list. -2. On that data source's page, you can manage the data source, choose a use case, and manage access controls and configurations. An example UI is shown in the following image. - - Manage data sources UI - -3. (Optional) Explore the Amazon S3 use cases, including querying your data and optimizing query performance. Go to **Next steps** to learn more about each use case. +2. On the page for the data source, you can manage the data source, choose a use case, and configure access controls. +3. (Optional) Explore the Amazon S3 use cases, including querying your data and optimizing query performance. Refer to the [**Next steps**](#next-steps) section to learn more about each use case. ## Limitations -This feature is still under development, including the data integration functionality. For real-time updates, see the [developer documentation on GitHub](https://github.com/opensearch-project/opensearch-spark/blob/main/docs/index.md#limitations). +This feature is currently under development, including the data integration functionality. For up-to-date information, refer to the [developer documentation on GitHub](https://github.com/opensearch-project/opensearch-spark/blob/main/docs/index.md#limitations). ## Next steps - Learn about [querying your data in Data Explorer]({{site.url}}{{site.baseurl}}/dashboards/management/query-data-source/) through OpenSearch Dashboards. -- Learn about ways to [optimize the query performance of your external data sources]({{site.url}}{{site.baseurl}}/dashboards/management/accelerate-external-data/), such as Amazon S3, through Query Workbench. +- Learn about [optimizing the query performance of your external data sources]({{site.url}}{{site.baseurl}}/dashboards/management/accelerate-external-data/), such as Amazon S3, through Query Workbench. - Learn about [Amazon S3 and AWS Glue Data Catalog](https://github.com/opensearch-project/sql/blob/main/docs/user/ppl/admin/connectors/s3glue_connector.rst) and the APIS used with Amazon S3 data sources, including configuration settings and query examples. - Learn about [managing your indexes]({{site.url}}{{site.baseurl}}/dashboards/im-dashboards/index/) through OpenSearch Dashboards. - diff --git a/_dashboards/management/accelerate-external-data.md b/_dashboards/management/accelerate-external-data.md index 00e4600ffde..6d1fa030e40 100644 --- a/_dashboards/management/accelerate-external-data.md +++ b/_dashboards/management/accelerate-external-data.md @@ -12,55 +12,37 @@ Introduced 2.11 {: .label .label-purple } -Query performance can be slow when using external data sources for reasons such as network latency, data transformation, and data volume. You can optimize your query performance by using OpenSearch indexes, such as a skipping index or a covering index. A _skipping index_ uses skip acceleration methods, such as partition, minimum and maximum values, and value sets, to ingest and create compact aggregate data structures. This makes them an economical option for direct querying scenarios. A _covering index_ ingests all or some of the data from the source into OpenSearch and makes it possible to use all OpenSearch Dashboards and plugin functionality. See the [Flint Index Reference Manual](https://github.com/opensearch-project/opensearch-spark/blob/main/docs/index.md) for comprehensive guidance on this feature's indexing process. +Query performance can be slow when using external data sources for reasons such as network latency, data transformation, and data volume. You can optimize your query performance by using OpenSearch indexes, such as a skipping index or a covering index. + +A _skipping index_ uses skip acceleration methods, such as partition, minimum and maximum values, and value sets, to ingest and create compact aggregate data structures. This makes them an economical option for direct querying scenarios. + +A _covering index_ ingests all or some of the data from the source into OpenSearch and makes it possible to use all OpenSearch Dashboards and plugin functionality. See the [Flint Index Reference Manual](https://github.com/opensearch-project/opensearch-spark/blob/main/docs/index.md) for comprehensive guidance on this feature's indexing process. ## Data sources use case: Accelerate performance To get started with the **Accelerate performance** use case available in **Data sources**, follow these steps: 1. Go to **OpenSearch Dashboards** > **Query Workbench** and select your Amazon S3 data source from the **Data sources** dropdown menu in the upper-left corner. -2. From the left-side navigation menu, select a database. An example using the `http_logs` database is shown in the following image. - - Query Workbench accelerate data UI - +2. From the left-side navigation menu, select a database. 3. View the results in the table and confirm that you have the desired data. 4. Create an OpenSearch index by following these steps: - 1. Select the **Accelerate data** button. A pop-up window appears. An example is shown in the following image. - - Accelerate data pop-up window - + 1. Select the **Accelerate data** button. A pop-up window appears. 2. Enter your details in **Select data fields**. In the **Database** field, select the desired acceleration index: **Skipping index** or **Covering index**. A _skipping index_ uses skip acceleration methods, such as partition, min/max, and value sets, to ingest data using compact aggregate data structures. This makes them an economical option for direct querying scenarios. A _covering index_ ingests all or some of the data from the source into OpenSearch and makes it possible to use all OpenSearch Dashboards and plugin functionality. - -5. Under **Index settings**, enter the information for your acceleration index. For information about naming, select **Help**. Note that an Amazon S3 table can only have one skipping index at a time. An example is shown in the following image. - - Skipping index settings +5. Under **Index settings**, enter the information for your acceleration index. For information about naming, select **Help**. Note that an Amazon S3 table can only have one skipping index at a time. ### Define skipping index settings -1. Under **Skipping index definition**, select the **Add fields** button to define the skipping index acceleration method and choose the fields you want to add. An example is shown in the following image. - - Skipping index add fields - +1. Under **Skipping index definition**, select the **Add fields** button to define the skipping index acceleration method and choose the fields you want to add. 2. Select the **Copy Query to Editor** button to apply your skipping index settings. -3. View the skipping index query details in the table pane and then select the **Run** button. Your index is added to the left-side navigation menu containing the list of your databases. An example is shown in the following image. - - Run a skippping or covering index UI +3. View the skipping index query details in the table pane and then select the **Run** button. Your index is added to the left-side navigation menu containing the list of your databases. ### Define covering index settings -1. Under **Index settings**, enter a valid index name. Note that each Amazon S3 table can have multiple covering indexes. An example is shown in the following image. - - Covering index settings - -2. Once you have added the index name, define the covering index fields by selecting `(add fields here)` under **Covering index definition**. An example is shown in the following image. - - Covering index field naming - +1. Under **Index settings**, enter a valid index name. Note that each Amazon S3 table can have multiple covering indexes. +2. Once you have added the index name, define the covering index fields by selecting `(add fields here)` under **Covering index definition**. 3. Select the **Copy Query to Editor** button to apply your covering index settings. -4. View the covering index query details in the table pane and then select the **Run** button. Your index is added to the left-side navigation menu containing the list of your databases. An example UI is shown in the following image. - - Run index in Query Workbench +4. View the covering index query details in the table pane and then select the **Run** button. Your index is added to the left-side navigation menu containing the list of your databases. ## Limitations -This feature is still under development, so there are some limitations. For real-time updates, see the [developer documentation on GitHub](https://github.com/opensearch-project/opensearch-spark/blob/main/docs/index.md#limitations). +This feature is still under development, so there are some limitations. For real-time updates, refer to the [developer documentation on GitHub](https://github.com/opensearch-project/opensearch-spark/blob/main/docs/index.md#limitations). diff --git a/_dashboards/management/connect-prometheus.md b/_dashboards/management/connect-prometheus.md new file mode 100644 index 00000000000..ed5545bc563 --- /dev/null +++ b/_dashboards/management/connect-prometheus.md @@ -0,0 +1,53 @@ +--- +layout: default +title: Connecting Prometheus to OpenSearch +parent: Data sources +nav_order: 20 +--- + +# Connecting Prometheus to OpenSearch +Introduced 2.16 +{: .label .label-purple } + +This documentation covers the key steps to connect Prometheus to OpenSearch using the OpenSearch Dashboards interface, including setting up the data source connection, modifying the connection details, and creating an index pattern for the Prometheus data. + +## Prerequisites and permissions + +Before connecting a data source, ensure you have met the [Prerequisites]({{site.url}}{{site.baseurl}}/dashboards/management/data-sources/#prerequisites) and have the necessary [Permissions]({{site.url}}{{site.baseurl}}/dashboards/management/data-sources/#permissions). + +## Create a Prometheus data source connection + +A data source connection specifies the parameters needed to connect to a data source. These parameters form a connection string for the data source. Using OpenSearch Dashboards, you can add new **Prometheus** data source connections or manage existing ones. + +Follow these steps to connect your data source: + +1. From the OpenSearch Dashboards main menu, go to **Management** > **Data sources** > **New data source** > **Prometheus**. + +2. From the **Configure Prometheus data source** section: + + - Under **Data source details**, provide a title and optional description. + - Under **Prometheus data location**, enter the Prometheus URI. + - Under **Authentication details**, select the appropriate authentication method from the dropdown list and enter the required details: + - **Basic authentication**: Enter a username and password. + - **AWS Signature Version 4**: Specify the **Region**, select the OpenSearch service from the **Service Name** list (**Amazon OpenSearch Service** or **Amazon OpenSearch Serverless**), and enter the **Access Key** and **Secret Key**. + - Under **Query permissions**, choose the role needed to search and index data. If you select **Restricted**, an additional field will become available to configure the required role. + +3. Select **Review Configuration** > **Connect to Prometheus** to save your settings. The new connection will appear in the list of data sources. + +## Modify a data source connection + +To modify a data source connection, follow these steps: + +1. Select the desired connection from the list on the **Data sources** main page. This will open the **Connection Details** window. +2. Within the **Connection Details** window, edit the **Title** and **Description** fields. Select the **Save changes** button to apply the changes. +3. To update the **Authentication Method**, choose the method from the dropdown list and enter any necessary credentials. Select **Save changes** to apply the changes. + - To update the **Basic authentication** authentication method, select the **Update stored password** button. Within the pop-up window, enter the updated password and confirm it and select **Update stored password** to save the changes. To test the connection, select the **Test connection** button. + - To update the **AWS Signature Version 4** authentication method, select the **Update stored AWS credential** button. Within the pop-up window, enter the updated access and secret keys and select **Update stored AWS credential** to save the changes. To test the connection, select the **Test connection** button. + +## Delete a data source connection + +To delete the data source connection, select the {::nomarkdown}delete icon{:/} icon. + +## Create an index pattern + +After creating a data source connection, the next step is to create an index pattern for that data source. For more information and a tutorial on index patterns, refer to [Index patterns]({{site.url}}{{site.baseurl}}/dashboards/management/index-patterns/). diff --git a/_dashboards/management/data-sources.md b/_dashboards/management/data-sources.md index fdd4edc150f..62d3a5aab2b 100644 --- a/_dashboards/management/data-sources.md +++ b/_dashboards/management/data-sources.md @@ -13,67 +13,37 @@ This documentation focuses on using the OpenSeach Dashboards interface to connec ## Prerequisites -The first step in connecting your data sources to OpenSearch is to install OpenSearch and OpenSearch Dashboards on your system. You can follow the installation instructions in the [OpenSearch documentation]({{site.url}}{{site.baseurl}}/install-and-configure/index/) to install these tools. +The first step in connecting your data sources to OpenSearch is to install OpenSearch and OpenSearch Dashboards on your system. Refer to the [installation instructions]({{site.url}}{{site.baseurl}}/install-and-configure/index/) for information. Once you have installed OpenSearch and OpenSearch Dashboards, you can use Dashboards to connect your data sources to OpenSearch and then use Dashboards to manage data sources, create index patterns based on those data sources, run queries against a specific data source, and combine visualizations in one dashboard. Configuration of the [YAML files]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/#configuration-file) and installation of the `dashboards-observability` and `opensearch-sql` plugins is necessary. For more information, see [OpenSearch plugins]({{site.url}}{{site.baseurl}}/install-and-configure/plugins/). -## Permissions - -To work with data sources in OpenSearch Dashboards, make sure that the user has been assigned the correct cluster-level [data source permission]({{site.url}}{{site.baseurl}}/security/access-control/permissions#data-source-permissions). - - - -## Create a data source connection - -A data source connection specifies the parameters needed to connect to a data source. These parameters form a connection string for the data source. Using Dashboards, you can add new data source connections or manage existing ones. - -The following steps guide you through the basics of creating a data source connection: - -1. From the OpenSearch Dashboards main menu, select **Management** > **Data sources** > **Create data source connection**. The UI is shown in the following image. +To securely store and encrypt data source connections in OpenSearch, you must add the following configuration to the `opensearch.yml` file on all the nodes: - Connecting a data source UI +`plugins.query.datasources.encryption.masterkey: "YOUR_GENERATED_MASTER_KEY_HERE"` -2. Create the data source connection by entering the appropriate information into the **Connection Details** and **Authentication Method** fields. - - - Under **Connection Details**, enter a title and endpoint URL. For this tutorial, use the URL `http://localhost:5601/app/management/opensearch-dashboards/dataSources`. Entering a description is optional. +The key must be 16, 24, or 32 characters. You can use the following command to generate a 24-character key: - - Under **Authentication Method**, select an authentication method from the dropdown list. Once an authentication method is selected, the applicable fields for that method appear. You can then enter the required details. The authentication method options are: - - **No authentication**: No authentication is used to connect to the data source. - - **Username & Password**: A basic username and password are used to connect to the data source. - - **AWS SigV4**: An AWS Signature Version 4 authenticating request is used to connect to the data source. AWS Signature Version 4 requires an access key and a secret key. - - For AWS Signature Version 4 authentication, first specify the **Region**. Next, select the OpenSearch service in the **Service Name** list. The options are **Amazon OpenSearch Service** and **Amazon OpenSearch Serverless**. Lastly, enter the **Access Key** and **Secret Key** for authorization. +`openssl rand -hex 12` - After you have populated the required fields, the **Test connection** and **Create data source** buttons become active. You can select **Test connection** to confirm that the connection is valid. +Generating 12 bytes results in a hexadecimal string that is 12 * 2 = 24 characters. +{: .note} -3. Select **Create data source** to save your settings. The connection is created. The active window returns to the **Data sources** main page, and the new connection appears in the list of data sources. - -4. To delete a data source connection, select the checkbox to the left of the data source **Title** and then select the **Delete 1 connection** button. Selecting multiple checkboxes for multiple connections is supported. An example UI is shown in the following image. - - Deleting a data source UI - -### Modify a data source connection - -To make changes to a data source connection, select a connection in the list on the **Data sources** main page. The **Connection Details** window opens. - -To make changes to **Connection Details**, edit one or both of the **Title** and **Description** fields and select **Save changes** in the lower-right corner of the screen. You can also cancel changes here. To change the **Authentication Method**, choose a different authentication method, enter your credentials (if applicable), and then select **Save changes** in the lower-right corner of the screen. The changes are saved. - -When **Username & Password** is the selected authentication method, you can update the password by choosing **Update stored password** next to the **Password** field. In the pop-up window, enter a new password in the first field and then enter it again in the second field to confirm. Select **Update stored password** in the pop-up window. The new password is saved. Select **Test connection** to confirm that the connection is valid. +## Permissions -When **AWS SigV4** is the selected authentication method, you can update the credentials by selecting **Update stored AWS credential**. In the pop-up window, enter a new access key in the first field and a new secret key in the second field. Select **Update stored AWS credential** in the pop-up window. The new credentials are saved. Select **Test connection** in the upper-right corner of the screen to confirm that the connection is valid. +To work with data sources in OpenSearch Dashboards, you must be assigned the correct cluster-level [data source permissions]({{site.url}}{{site.baseurl}}/security/access-control/permissions#data-source-permissions). -To delete the data source connection, select the delete icon ({::nomarkdown}delete icon{:/}). +## Types of data streams -## Create an index pattern +To configure data sources through OpenSearch Dashboards, go to **Management** > **Dashboards Management** > **Data sources**. This flow can be used for OpenSearch data stream connections. See [Configuring and using multiple data sources]({{site.url}}{{site.baseurl}}/dashboards/management/multi-data-sources/). -Once you've created a data source connection, you can create an index pattern for the data source. An _index pattern_ is a template that OpenSearch uses to create indexes for data from the data source. See [Index patterns]({{site.url}}{{site.baseurl}}/dashboards/management/index-patterns/) for more information and a tutorial. +Alternatively, if you are running OpenSearch Dashboards 2.16 or later, go to **Management** > **Data sources**. This flow can be used to connect Amazon Simple Storage Service (Amazon S3) and Prometheus. See [Connecting Amazon S3 to OpenSearch]({{site.url}}{{site.baseurl}}/dashboards/management/S3-data-source/) and [Connecting Prometheus to OpenSearch]({{site.url}}{{site.baseurl}}/dashboards/management/connect-prometheus/) for more information. ## Next steps - Learn about [managing index patterns]({{site.url}}{{site.baseurl}}/dashboards/management/index-patterns/) through OpenSearch Dashboards. - Learn about [indexing data using Index Management]({{site.url}}{{site.baseurl}}/dashboards/im-dashboards/index/) through OpenSearch Dashboards. - Learn about how to connect [multiple data sources]({{site.url}}{{site.baseurl}}/dashboards/management/multi-data-sources/). -- Learn about how to [connect OpenSearch and Amazon S3 through OpenSearch Dashboards]({{site.url}}{{site.baseurl}}/dashboards/management/S3-data-source/). -- Learn about the [Integrations]({{site.url}}{{site.baseurl}}/integrations/index/) tool, which gives you the flexibility to use various data ingestion methods and connect data from the Dashboards UI. - +- Learn about how to connect [OpenSearch and Amazon S3]({{site.url}}{{site.baseurl}}/dashboards/management/S3-data-source/) and [OpenSearch and Prometheus]({{site.url}}{{site.baseurl}}/dashboards/management/connect-prometheus/) using the OpenSearch Dashboards interface. +- Learn about the [Integrations]({{site.url}}{{site.baseurl}}/integrations/index/) plugin, which gives you the flexibility to use various data ingestion methods and connect data to OpenSearch Dashboards. diff --git a/_dashboards/management/query-data-source.md b/_dashboards/management/query-data-source.md index f1496b3e172..a3392c073e4 100644 --- a/_dashboards/management/query-data-source.md +++ b/_dashboards/management/query-data-source.md @@ -11,7 +11,7 @@ has_children: false Introduced 2.11 {: .label .label-purple } -This tutorial guides you through using the **Query data** use case for querying and visualizing your Amazon Simple Storage Service (Amazon S3) data using OpenSearch Dashboards. +This tutorial guides you through using the **Query data** use case for querying and visualizing your Amazon Simple Storage Service (Amazon S3) data using OpenSearch Dashboards. ## Prerequisites @@ -22,15 +22,9 @@ You must be using the `opensearch-security` plugin and have the appropriate role To get started, follow these steps: 1. On the **Manage data sources** page, select your data source from the list. -2. On the data source's detail page, select the **Query data** card. This option takes you to the **Observability** > **Logs** page, which is shown in the following image. - - Observability Logs UI - +2. On the data source's detail page, select the **Query data** card. This option takes you to the **Observability** > **Logs** page. 3. Select the **Event Explorer** button. This option creates and saves frequently searched queries and visualizations using [Piped Processing Language (PPL)]({{site.url}}{{site.baseurl}}/search-plugins/sql/ppl/index/) or [SQL]({{site.url}}{{site.baseurl}}/search-plugins/sql/index/), which connects to Spark SQL. -4. Select the Amazon S3 data source from the dropdown menu in the upper-left corner. An example is shown in the following image. - - Observability Logs Amazon S3 dropdown menu - +4. Select the Amazon S3 data source from the dropdown menu in the upper-left corner. 5. Enter the query in the **Enter PPL query** field. Note that the default language is SQL. To change the language, select PPL from the dropdown menu. 6. Select the **Search** button. The **Query Processing** message is shown, confirming that your query is being processed. 7. View the results, which are listed in a table on the **Events** tab. On this page, details such as available fields, source, and time are shown in a table format. @@ -40,10 +34,7 @@ To get started, follow these steps: To create visualizations, follow these steps: -1. On the **Explorer** page, select the **Visualizations** tab. An example is shown in the following image. - - Explorer Amazon S3 visualizations UI - +1. On the **Explorer** page, select the **Visualizations** tab. 2. Select **Index data to visualize**. This option currently only creates [acceleration indexes]({{site.url}}{{site.baseurl}}/dashboards/management/accelerate-external-data/), which give you views of the data visualizations from the **Visualizations** tab. To create a visualization of your Amazon S3 data, go to **Discover**. See the [Discover documentation]({{site.url}}{{site.baseurl}}/dashboards/discover/index-discover/) for information and a tutorial. ## Use Query Workbench with your Amazon S3 data source @@ -53,14 +44,12 @@ To create visualizations, follow these steps: To use Query Workbench with your Amazon S3 data, follow these steps: 1. From the OpenSearch Dashboards main menu, select **OpenSearch Plugins** > **Query Workbench**. -2. From the **Data Sources** dropdown menu in the upper-left corner, choose your Amazon S3 data source. Your data begins loading the databases that are part of your data source. An example is shown in the following image. - - Query Workbench Amazon S3 data loading UI - +2. From the **Data Sources** dropdown menu in the upper-left corner, choose your Amazon S3 data source. Your data begins loading the databases that are part of your data source. 3. View the databases listed in the left-side navigation menu and select a database to view its details. Any information about acceleration indexes is listed under **Acceleration index destination**. 4. Choose the **Describe Index** button to learn more about how data is stored in that particular index. 5. Choose the **Drop index** button to delete and clear both the OpenSearch index and the Amazon S3 Spark job that refreshes the data. -6. Enter your SQL query and select **Run**. +6. Enter your SQL query and select **Run**. + ## Next steps - Learn about [accelerating the query performance of your external data sources]({{site.url}}{{site.baseurl}}/dashboards/management/accelerate-external-data/). diff --git a/_dashboards/query-workbench.md b/_dashboards/query-workbench.md index 8fe41afcdf7..700d6a73404 100644 --- a/_dashboards/query-workbench.md +++ b/_dashboards/query-workbench.md @@ -8,19 +8,14 @@ redirect_from: # Query Workbench -Query Workbench is a tool within OpenSearch Dashboards. You can use Query Workbench to run on-demand [SQL]({{site.url}}{{site.baseurl}}/search-plugins/sql/sql/index/) and [PPL]({{site.url}}{{site.baseurl}}/search-plugins/sql/ppl/index/) queries, translate queries into their equivalent REST API calls, and view and save results in different [response formats]({{site.url}}{{site.baseurl}}/search-plugins/sql/response-formats/). +You can use Query Workbench in OpenSearch Dashboards to run on-demand [SQL]({{site.url}}{{site.baseurl}}/search-plugins/sql/sql/index/) and [PPL]({{site.url}}{{site.baseurl}}/search-plugins/sql/ppl/index/) queries, translate queries into their equivalent REST API calls, and view and save results in different [response formats]({{site.url}}{{site.baseurl}}/search-plugins/sql/response-formats/). -A view of the Query Workbench interface within OpenSearch Dashboards is shown in the following image. - -Query Workbench interface within OpenSearch Dashboards - -## Prerequisites - -Before getting started, make sure you have [indexed your data]({{site.url}}{{site.baseurl}}/im-plugin/index/). +Query Workbench does not support delete or update operations through SQL or PPL. Access to data is read-only. +{: .important} -For this tutorial, you can index the following sample documents. Alternatively, you can use the [OpenSearch Playground](https://playground.opensearch.org/app/opensearch-query-workbench#/), which has preloaded indexes that you can use to try out Query Workbench. +## Prerequisites -To index sample documents, send the following [Bulk API]({{site.url}}{{site.baseurl}}/api-reference/document-apis/bulk/) request: +Before getting started with this tutorial, index the sample documents by sending the following [Bulk API]({{site.url}}{{site.baseurl}}/api-reference/document-apis/bulk/) request: ```json PUT accounts/_bulk?refresh @@ -35,9 +30,11 @@ PUT accounts/_bulk?refresh ``` {% include copy-curl.html %} -## Running SQL queries within Query Workbench +See [Managing indexes]({{site.url}}{{site.baseurl}}/im-plugin/index/) to learn about indexing your own data. -Follow these steps to learn how to run SQL queries against your OpenSearch data using Query Workbench: +## Running SQL queries within Query Workbench + + The following steps guide you through running SQL queries against OpenSearch data: 1. Access Query Workbench. - To access Query Workbench, go to OpenSearch Dashboards and choose **OpenSearch Plugins** > **Query Workbench** from the main menu. @@ -64,23 +61,15 @@ Follow these steps to learn how to run SQL queries against your OpenSearch data 3. View the results. - View the results in the **Results** pane, which presents the query output in tabular format. You can filter and download the results as needed. - The following image shows the query editor pane and results pane for the preceding SQL query: - - Query Workbench SQL query input and results output panes - 4. Clear the query editor. - Select the **Clear** button to clear the query editor and run a new query. 5. Examine how the query is processed. - - Select the **Explain** button to examine how OpenSearch processes the query, including the steps involved and order of operations. - - The following image shows the explanation of the SQL query that was run in step 2. - - Query Workbench SQL query explanation pane + - Select the **Explain** button to examine how OpenSearch processes the query, including the steps involved and order of operations. ## Running PPL queries within Query Workbench -Follow these steps to learn how to run PPL queries against your OpenSearch data using Query Workbench: +Follow these steps to learn how to run PPL queries against OpenSearch data: 1. Access Query Workbench. - To access Query Workbench, go to OpenSearch Dashboards and choose **OpenSearch Plugins** > **Query Workbench** from the main menu. @@ -100,19 +89,8 @@ Follow these steps to learn how to run PPL queries against your OpenSearch data 3. View the results. - View the results in the **Results** pane, which presents the query output in tabular format. - The following image shows the query editor pane and results pane for the PPL query that was run in step 2: - - Query Workbench PPL query input and results output panes - 4. Clear the query editor. - Select the **Clear** button to clear the query editor and run a new query. 5. Examine how the query is processed. - - Select the **Explain** button to examine how OpenSearch processes the query, including the steps involved and order of operations. - - The following image shows the explanation of the PPL query that was run in step 2. - - Query Workbench PPL query explanation pane - -Query Workbench does not support delete or update operations through SQL or PPL. Access to data is read-only. -{: .important} \ No newline at end of file + - Select the **Explain** button to examine how OpenSearch processes the query, including the steps involved and order of operations. diff --git a/_data-prepper/common-use-cases/log-analytics.md b/_data-prepper/common-use-cases/log-analytics.md index 30a021b1010..ceb26ff5b78 100644 --- a/_data-prepper/common-use-cases/log-analytics.md +++ b/_data-prepper/common-use-cases/log-analytics.md @@ -147,6 +147,6 @@ The following is an example `fluent-bit.conf` file with SSL and basic authentica See the [Data Prepper Log Ingestion Demo Guide](https://github.com/opensearch-project/data-prepper/blob/main/examples/log-ingestion/README.md) for a specific example of Apache log ingestion from `FluentBit -> Data Prepper -> OpenSearch` running through Docker. -In the future, Data Prepper will offer additional sources and processors that will make more complex log analytics pipelines available. Check out the [Data Prepper Project Roadmap](https://github.com/opensearch-project/data-prepper/projects/1) to see what is coming. +In the future, Data Prepper will offer additional sources and processors that will make more complex log analytics pipelines available. Check out the [Data Prepper Project Roadmap](https://github.com/orgs/opensearch-project/projects/221) to see what is coming. If there is a specific source, processor, or sink that you would like to include in your log analytics workflow and is not currently on the roadmap, please bring it to our attention by creating a GitHub issue. Additionally, if you are interested in contributing to Data Prepper, see our [Contributing Guidelines](https://github.com/opensearch-project/data-prepper/blob/main/CONTRIBUTING.md) as well as our [developer guide](https://github.com/opensearch-project/data-prepper/blob/main/docs/developer_guide.md) and [plugin development guide](https://github.com/opensearch-project/data-prepper/blob/main/docs/plugin_development.md). diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index 3ff266cccf1..6bae749d387 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -173,14 +173,14 @@ When you provide your own Avro schema, that schema defines the final structure o In cases where your data is uniform, you may be able to automatically generate a schema. Automatically generated schemas are based on the first event that the codec receives. The schema will only contain keys from this event, and all keys must be present in all events in order to automatically generate a working schema. Automatically generated schemas make all fields nullable. Use the `include_keys` and `exclude_keys` sink configurations to control which data is included in the automatically generated schema. -Avro fields should use a null [union](https://avro.apache.org/docs/1.10.2/spec.html#Unions) because this will allow missing values. Otherwise, all required fields must be present for each event. Use non-nullable fields only when you are certain they exist. +Avro fields should use a null [union](https://avro.apache.org/docs/1.12.0/specification/#unions) because this will allow missing values. Otherwise, all required fields must be present for each event. Use non-nullable fields only when you are certain they exist. Use the following options to configure the codec. Option | Required | Type | Description :--- | :--- | :--- | :--- -`schema` | Yes | String | The Avro [schema declaration](https://avro.apache.org/docs/1.2.0/spec.html#schemas). Not required if `auto_schema` is set to true. -`auto_schema` | No | Boolean | When set to `true`, automatically generates the Avro [schema declaration](https://avro.apache.org/docs/1.2.0/spec.html#schemas) from the first event. +`schema` | Yes | String | The Avro [schema declaration](https://avro.apache.org/docs/1.12.0/specification/#schema-declaration). Not required if `auto_schema` is set to true. +`auto_schema` | No | Boolean | When set to `true`, automatically generates the Avro [schema declaration](https://avro.apache.org/docs/1.12.0/specification/#schema-declaration) from the first event. ### `ndjson` codec @@ -208,8 +208,8 @@ Use the following options to configure the codec. Option | Required | Type | Description :--- | :--- | :--- | :--- -`schema` | Yes | String | The Avro [schema declaration](https://avro.apache.org/docs/current/specification/#schema-declaration). Not required if `auto_schema` is set to true. -`auto_schema` | No | Boolean | When set to `true`, automatically generates the Avro [schema declaration](https://avro.apache.org/docs/current/specification/#schema-declaration) from the first event. +`schema` | Yes | String | The Avro [schema declaration](https://avro.apache.org/docs/1.12.0/specification/#schema-declaration). Not required if `auto_schema` is set to true. +`auto_schema` | No | Boolean | When set to `true`, automatically generates the Avro [schema declaration](https://avro.apache.org/docs/1.12.0/specification/#schema-declaration) from the first event. ### Setting a schema with Parquet diff --git a/_data-prepper/pipelines/configuration/sources/documentdb.md b/_data-prepper/pipelines/configuration/sources/documentdb.md index c453b60a39d..d3dd31edcbe 100644 --- a/_data-prepper/pipelines/configuration/sources/documentdb.md +++ b/_data-prepper/pipelines/configuration/sources/documentdb.md @@ -3,7 +3,7 @@ layout: default title: documentdb parent: Sources grand_parent: Pipelines -nav_order: 2 +nav_order: 10 --- # documentdb diff --git a/_data-prepper/pipelines/configuration/sources/dynamo-db.md b/_data-prepper/pipelines/configuration/sources/dynamo-db.md index e465f450446..c5a7c8d1887 100644 --- a/_data-prepper/pipelines/configuration/sources/dynamo-db.md +++ b/_data-prepper/pipelines/configuration/sources/dynamo-db.md @@ -3,7 +3,7 @@ layout: default title: dynamodb parent: Sources grand_parent: Pipelines -nav_order: 3 +nav_order: 20 --- # dynamodb diff --git a/_data-prepper/pipelines/configuration/sources/http.md b/_data-prepper/pipelines/configuration/sources/http.md index 06933edc1c4..574f49e2894 100644 --- a/_data-prepper/pipelines/configuration/sources/http.md +++ b/_data-prepper/pipelines/configuration/sources/http.md @@ -3,14 +3,14 @@ layout: default title: http parent: Sources grand_parent: Pipelines -nav_order: 5 +nav_order: 30 redirect_from: - /data-prepper/pipelines/configuration/sources/http-source/ --- # http -The `http` plugin accepts HTTP requests from clients. Currently, `http` only supports the JSON UTF-8 codec for incoming requests, such as `[{"key1": "value1"}, {"key2": "value2"}]`. The following table describes options you can use to configure the `http` source. +The `http` plugin accepts HTTP requests from clients. The following table describes options you can use to configure the `http` source. Option | Required | Type | Description :--- | :--- | :--- | :--- @@ -36,6 +36,19 @@ aws_region | Conditionally | String | AWS region used by ACM or Amazon S3. Requi Content will be added to this section.---> +## Ingestion + +Clients should send HTTP `POST` requests to the endpoint `/log/ingest`. + +The `http` protocol only supports the JSON UTF-8 codec for incoming requests, for example, `[{"key1": "value1"}, {"key2": "value2"}]`. + +#### Example: Ingest data with cURL + +The following cURL command can be used to ingest data: + +`curl "http://localhost:2021/log/ingest" --data '[{"key1": "value1"}, {"key2": "value2"}]'` +{% include copy-curl.html %} + ## Metrics The `http` source includes the following metrics. diff --git a/_data-prepper/pipelines/configuration/sources/kafka.md b/_data-prepper/pipelines/configuration/sources/kafka.md index e8452a93c3b..ecd7c7eaa0a 100644 --- a/_data-prepper/pipelines/configuration/sources/kafka.md +++ b/_data-prepper/pipelines/configuration/sources/kafka.md @@ -3,7 +3,7 @@ layout: default title: kafka parent: Sources grand_parent: Pipelines -nav_order: 6 +nav_order: 40 --- # kafka diff --git a/_data-prepper/pipelines/configuration/sources/opensearch.md b/_data-prepper/pipelines/configuration/sources/opensearch.md index a7ba9657297..1ee22375753 100644 --- a/_data-prepper/pipelines/configuration/sources/opensearch.md +++ b/_data-prepper/pipelines/configuration/sources/opensearch.md @@ -3,7 +3,7 @@ layout: default title: opensearch parent: Sources grand_parent: Pipelines -nav_order: 30 +nav_order: 50 --- # opensearch diff --git a/_data-prepper/pipelines/configuration/sources/otel-logs-source.md b/_data-prepper/pipelines/configuration/sources/otel-logs-source.md index 068369efaf9..38095d7d7f7 100644 --- a/_data-prepper/pipelines/configuration/sources/otel-logs-source.md +++ b/_data-prepper/pipelines/configuration/sources/otel-logs-source.md @@ -3,7 +3,7 @@ layout: default title: otel_logs_source parent: Sources grand_parent: Pipelines -nav_order: 25 +nav_order: 60 --- # otel_logs_source diff --git a/_data-prepper/pipelines/configuration/sources/otel-metrics-source.md b/_data-prepper/pipelines/configuration/sources/otel-metrics-source.md index bea74a96d33..0e8d3778280 100644 --- a/_data-prepper/pipelines/configuration/sources/otel-metrics-source.md +++ b/_data-prepper/pipelines/configuration/sources/otel-metrics-source.md @@ -3,7 +3,7 @@ layout: default title: otel_metrics_source parent: Sources grand_parent: Pipelines -nav_order: 10 +nav_order: 70 --- # otel_metrics_source diff --git a/_data-prepper/pipelines/configuration/sources/otel-trace-source.md b/_data-prepper/pipelines/configuration/sources/otel-trace-source.md index 1be7864c33b..de45a5de638 100644 --- a/_data-prepper/pipelines/configuration/sources/otel-trace-source.md +++ b/_data-prepper/pipelines/configuration/sources/otel-trace-source.md @@ -3,7 +3,7 @@ layout: default title: otel_trace_source parent: Sources grand_parent: Pipelines -nav_order: 15 +nav_order: 80 redirect_from: - /data-prepper/pipelines/configuration/sources/otel-trace/ --- diff --git a/_data-prepper/pipelines/configuration/sources/pipeline.md b/_data-prepper/pipelines/configuration/sources/pipeline.md new file mode 100644 index 00000000000..6ba025bd18c --- /dev/null +++ b/_data-prepper/pipelines/configuration/sources/pipeline.md @@ -0,0 +1,31 @@ +--- +layout: default +title: pipeline +parent: Sources +grand_parent: Pipelines +nav_order: 90 +--- + +# pipeline + +Use the `pipeline` sink to read from another pipeline. + +## Configuration options + +The `pipeline` sink supports the following configuration options. + +| Option | Required | Type | Description | +|:-------|:---------|:-------|:---------------------------------------| +| `name` | Yes | String | The name of the pipeline to read from. | + +## Usage + +The following example configures a `pipeline` sink named `sample-pipeline` that reads from a pipeline named `movies`: + +```yaml +sample-pipeline: + source: + - pipeline: + name: "movies" +``` +{% include copy.html %} diff --git a/_data-prepper/pipelines/configuration/sources/s3.md b/_data-prepper/pipelines/configuration/sources/s3.md index 5a7d9986e5b..db92718a36f 100644 --- a/_data-prepper/pipelines/configuration/sources/s3.md +++ b/_data-prepper/pipelines/configuration/sources/s3.md @@ -3,7 +3,7 @@ layout: default title: s3 source parent: Sources grand_parent: Pipelines -nav_order: 20 +nav_order: 100 --- # s3 source diff --git a/_data-prepper/pipelines/configuration/sources/sources.md b/_data-prepper/pipelines/configuration/sources/sources.md index 811b161e169..682f215517a 100644 --- a/_data-prepper/pipelines/configuration/sources/sources.md +++ b/_data-prepper/pipelines/configuration/sources/sources.md @@ -3,7 +3,7 @@ layout: default title: Sources parent: Pipelines has_children: true -nav_order: 20 +nav_order: 110 --- # Sources diff --git a/_data/versions.json b/_data/versions.json index 4f7e55c21b8..c14e91fa0cf 100644 --- a/_data/versions.json +++ b/_data/versions.json @@ -1,10 +1,11 @@ { - "current": "2.16", + "current": "2.17", "all": [ - "2.16", + "2.17", "1.3" ], "archived": [ + "2.16", "2.15", "2.14", "2.13", @@ -25,7 +26,7 @@ "1.1", "1.0" ], - "latest": "2.16" + "latest": "2.17" } diff --git a/_field-types/index.md b/_field-types/index.md index 7a7e816adab..e9250f409d7 100644 --- a/_field-types/index.md +++ b/_field-types/index.md @@ -12,43 +12,77 @@ redirect_from: # Mappings and field types -You can define how documents and their fields are stored and indexed by creating a _mapping_. The mapping specifies the list of fields for a document. Every field in the document has a _field type_, which defines the type of data the field contains. For example, you may want to specify that the `year` field should be of type `date`. To learn more, see [Supported field types]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/index/). +Mappings tell OpenSearch how to store and index your documents and their fields. You can specify the data type for each field (for example, `year` as `date`) to make storage and querying more efficient. -If you're just starting to build out your cluster and data, you may not know exactly how your data should be stored. In those cases, you can use dynamic mappings, which tell OpenSearch to dynamically add data and its fields. However, if you know exactly what types your data falls under and want to enforce that standard, then you can use explicit mappings. +While [dynamic mappings](#dynamic-mapping) automatically add new data and fields, using explicit mappings is recommended. Explicit mappings let you define the exact structure and data types upfront. This helps to maintain data consistency and optimize performance, especially for large datasets or high-volume indexing operations. -For example, if you want to indicate that `year` should be of type `text` instead of an `integer`, and `age` should be an `integer`, you can do so with explicit mappings. By using dynamic mapping, OpenSearch might interpret both `year` and `age` as integers. +For example, with explicit mappings, you can ensure that `year` is treated as text and `age` as an integer instead of both being interpreted as integers by dynamic mapping. -This section provides an example for how to create an index mapping and how to add a document to it that will get ip_range validated. - -#### Table of contents -1. TOC -{:toc} - - ---- ## Dynamic mapping When you index a document, OpenSearch adds fields automatically with dynamic mapping. You can also explicitly add fields to an index mapping. -#### Dynamic mapping types +### Dynamic mapping types Type | Description :--- | :--- -null | A `null` field can't be indexed or searched. When a field is set to null, OpenSearch behaves as if that field has no values. -boolean | OpenSearch accepts `true` and `false` as boolean values. An empty string is equal to `false.` -float | A single-precision 32-bit floating point number. -double | A double-precision 64-bit floating point number. -integer | A signed 32-bit number. -object | Objects are standard JSON objects, which can have fields and mappings of their own. For example, a `movies` object can have additional properties such as `title`, `year`, and `director`. -array | Arrays in OpenSearch can only store values of one type, such as an array of just integers or strings. Empty arrays are treated as though they are fields with no values. -text | A string sequence of characters that represent full-text values. -keyword | A string sequence of structured characters, such as an email address or ZIP code. +`null` | A `null` field can't be indexed or searched. When a field is set to null, OpenSearch behaves as if the field has no value. +`boolean` | OpenSearch accepts `true` and `false` as Boolean values. An empty string is equal to `false.` +`float` | A single-precision, 32-bit floating-point number. +`double` | A double-precision, 64-bit floating-point number. +`integer` | A signed 32-bit number. +`object` | Objects are standard JSON objects, which can have fields and mappings of their own. For example, a `movies` object can have additional properties such as `title`, `year`, and `director`. +`array` | OpenSearch does not have a specific array data type. Arrays are represented as a set of values of the same data type (for example, integers or strings) associated with a field. When indexing, you can pass multiple values for a field, and OpenSearch will treat it as an array. Empty arrays are valid and recognized as array fields with zero elements---not as fields with no values. OpenSearch supports querying and filtering arrays, including checking for values, range queries, and array operations like concatenation and intersection. Nested arrays, which may contain complex objects or other arrays, can also be used for advanced data modeling. +`text` | A string sequence of characters that represent full-text values. +`keyword` | A string sequence of structured characters, such as an email address or ZIP code. date detection string | Enabled by default, if new string fields match a date's format, then the string is processed as a `date` field. For example, `date: "2012/03/11"` is processed as a date. numeric detection string | If disabled, OpenSearch may automatically process numeric values as strings when they should be processed as numbers. When enabled, OpenSearch can process strings into `long`, `integer`, `short`, `byte`, `double`, `float`, `half_float`, `scaled_float`, and `unsigned_long`. Default is disabled. +### Dynamic templates + +Dynamic templates are used to define custom mappings for dynamically added fields based on the data type, field name, or field path. They allow you to define a flexible schema for your data that can automatically adapt to changes in the structure or format of the input data. + +You can use the following syntax to define a dynamic mapping template: + +```json +PUT index +{ + "mappings": { + "dynamic_templates": [ + { + "fields": { + "mapping": { + "type": "short" + }, + "match_mapping_type": "string", + "path_match": "status*" + } + } + ] + } +} +``` +{% include copy-curl.html %} + +This mapping configuration dynamically maps any field with a name starting with `status` (for example, `status_code`) to the `short` data type if the initial value provided during indexing is a string. + +### Dynamic mapping parameters + +The `dynamic_templates` support the following parameters for matching conditions and mapping rules. The default value is `null`. + +Parameter | Description | +----------|-------------| +`match_mapping_type` | Specifies the JSON data type (for example, string, long, double, object, binary, Boolean, date) that triggers the mapping. +`match` | A regular expression used to match field names and apply the mapping. +`unmatch` | A regular expression used to exclude field names from the mapping. +`match_pattern` | Determines the pattern matching behavior, either `regex` or `simple`. Default is `simple`. +`path_match` | Allows you to match nested field paths using a regular expression. +`path_unmatch` | Excludes nested field paths from the mapping using a regular expression. +`mapping` | The mapping configuration to apply. + ## Explicit mapping -If you know exactly what your field data types need to be, you can specify them in your request body when creating your index. +If you know exactly which field data types you need to use, then you can specify them in your request body when creating your index, as shown in the following example request: ```json PUT sample-index1 @@ -62,8 +96,9 @@ PUT sample-index1 } } ``` +{% include copy-curl.html %} -### Response +#### Response ```json { "acknowledged": true, @@ -71,8 +106,9 @@ PUT sample-index1 "index": "sample-index1" } ``` +{% include copy-curl.html %} -To add mappings to an existing index or data stream, you can send a request to the `_mapping` endpoint using the `PUT` or `POST` HTTP method: +To add mappings to an existing index or data stream, you can send a request to the `_mapping` endpoint using the `PUT` or `POST` HTTP method, as shown in the following example request: ```json POST sample-index1/_mapping @@ -84,84 +120,29 @@ POST sample-index1/_mapping } } ``` +{% include copy-curl.html %} You cannot change the mapping of an existing field, you can only modify the field's mapping parameters. {: .note} ---- -## Mapping example usage +## Mapping parameters -The following example shows how to create a mapping to specify that OpenSearch should ignore any documents with malformed IP addresses that do not conform to the [`ip`]({{site.url}}{{site.baseurl}}/opensearch/supported-field-types/ip/) data type. You accomplish this by setting the `ignore_malformed` parameter to `true`. +Mapping parameters are used to configure the behavior of index fields. See [Mappings and field types]({{site.url}}{{site.baseurl}}/field-types/) for more information. -### Create an index with an `ip` mapping +## Mapping limit settings -To create an index, use a PUT request: +OpenSearch has certain mapping limits and settings, such as the settings listed in the following table. Settings can be configured based on your requirements. -```json -PUT /test-index -{ - "mappings" : { - "properties" : { - "ip_address" : { - "type" : "ip", - "ignore_malformed": true - } - } - } -} -``` - -You can add a document that has a malformed IP address to your index: - -```json -PUT /test-index/_doc/1 -{ - "ip_address" : "malformed ip address" -} -``` - -This indexed IP address does not throw an error because `ignore_malformed` is set to true. - -You can query the index using the following request: - -```json -GET /test-index/_search -``` +| Setting | Default value | Allowed value | Type | Description | +|-|-|-|-|-| +| `index.mapping.nested_fields.limit` | 50 | [0,) | Dynamic | Limits the maximum number of nested fields that can be defined in an index mapping. | +| `index.mapping.nested_objects.limit` | 10,000 | [0,) | Dynamic | Limits the maximum number of nested objects that can be created in a single document. | +| `index.mapping.total_fields.limit` | 1,000 | [0,) | Dynamic | Limits the maximum number of fields that can be defined in an index mapping. | +| `index.mapping.depth.limit` | 20 | [1,100] | Dynamic | Limits the maximum depth of nested objects and nested fields that can be defined in an index mapping. | +| `index.mapping.field_name_length.limit` | 50,000 | [1,50000] | Dynamic | Limits the maximum length of field names that can be defined in an index mapping. | +| `index.mapper.dynamic` | true | {true,false} | Dynamic | Determines whether new fields should be dynamically added to a mapping. | -The response shows that the `ip_address` field is ignored in the indexed document: - -```json -{ - "took": 14, - "timed_out": false, - "_shards": { - "total": 1, - "successful": 1, - "skipped": 0, - "failed": 0 - }, - "hits": { - "total": { - "value": 1, - "relation": "eq" - }, - "max_score": 1, - "hits": [ - { - "_index": "test-index", - "_id": "1", - "_score": 1, - "_ignored": [ - "ip_address" - ], - "_source": { - "ip_address": "malformed ip address" - } - } - ] - } -} -``` +--- ## Get a mapping @@ -170,14 +151,16 @@ To get all mappings for one or more indexes, use the following request: ```json GET /_mapping ``` +{% include copy-curl.html %} -In the above request, `` may be an index name or a comma-separated list of index names. +In the previous request, `` may be an index name or a comma-separated list of index names. To get all mappings for all indexes, use the following request: ```json GET _mapping ``` +{% include copy-curl.html %} To get a mapping for a specific field, provide the index name and the field name: @@ -185,14 +168,14 @@ To get a mapping for a specific field, provide the index name and the field name GET _mapping/field/ GET //_mapping/field/ ``` +{% include copy-curl.html %} -Both `` and `` can be specified as one value or a comma-separated list. - -For example, the following request retrieves the mapping for the `year` and `age` fields in `sample-index1`: +Both `` and `` can be specified as either one value or a comma-separated list. For example, the following request retrieves the mapping for the `year` and `age` fields in `sample-index1`: ```json GET sample-index1/_mapping/field/year,age ``` +{% include copy-curl.html %} The response contains the specified fields: @@ -220,3 +203,8 @@ The response contains the specified fields: } } ``` +{% include copy-curl.html %} + +## Mappings use cases + +See [Mappings use cases]({{site.url}}{{site.baseurl}}/field-types/mappings-use-cases/) for use case examples, including examples of mapping string fields and ignoring malformed IP addresses. diff --git a/_field-types/mappings-use-cases.md b/_field-types/mappings-use-cases.md new file mode 100644 index 00000000000..835e030bab5 --- /dev/null +++ b/_field-types/mappings-use-cases.md @@ -0,0 +1,122 @@ +--- +layout: default +title: Mappings use cases +parent: Mappings and fields types +nav_order: 5 +nav_exclude: true +--- + +# Mappings use cases + +Mappings provide control over how data is indexed and queried, enabling optimized performance and efficient storage for a range of use cases. + +--- + +## Example: Ignoring malformed IP addresses + +The following example shows you how to create a mapping specifying that OpenSearch should ignore any documents containing malformed IP addresses that do not conform to the [`ip`]({{site.url}}{{site.baseurl}}/opensearch/supported-field-types/ip/) data type. You can accomplish this by setting the `ignore_malformed` parameter to `true`. + +### Create an index with an `ip` mapping + +To create an index with an `ip` mapping, use a PUT request: + +```json +PUT /test-index +{ + "mappings" : { + "properties" : { + "ip_address" : { + "type" : "ip", + "ignore_malformed": true + } + } + } +} +``` +{% include copy-curl.html %} + +Then add a document with a malformed IP address: + +```json +PUT /test-index/_doc/1 +{ + "ip_address" : "malformed ip address" +} +``` +{% include copy-curl.html %} + +When you query the index, the `ip_address` field will be ignored. You can query the index using the following request: + +```json +GET /test-index/_search +``` +{% include copy-curl.html %} + +#### Response + +```json +{ + "took": 14, + "timed_out": false, + "_shards": { + "total": 1, + "successful": 1, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 1, + "relation": "eq" + }, + "max_score": 1, + "hits": [ + { + "_index": "test-index", + "_id": "1", + "_score": 1, + "_ignored": [ + "ip_address" + ], + "_source": { + "ip_address": "malformed ip address" + } + } + ] + } +} +``` +{% include copy-curl.html %} + +--- + +## Mapping string fields to `text` and `keyword` types + +To create an index named `movies1` with a dynamic template that maps all string fields to both the `text` and `keyword` types, you can use the following request: + +```json +PUT movies1 +{ + "mappings": { + "dynamic_templates": [ + { + "strings": { + "match_mapping_type": "string", + "mapping": { + "type": "text", + "fields": { + "keyword": { + "type": "keyword", + "ignore_above": 256 + } + } + } + } + } + ] + } +} +``` +{% include copy-curl.html %} + +This dynamic template ensures that any string fields in your documents will be indexed as both a full-text `text` type and a `keyword` type. diff --git a/_field-types/metadata-fields/field-names.md b/_field-types/metadata-fields/field-names.md new file mode 100644 index 00000000000..b17e94fbb44 --- /dev/null +++ b/_field-types/metadata-fields/field-names.md @@ -0,0 +1,43 @@ +--- +layout: default +title: Field names +nav_order: 10 +parent: Metadata fields +--- + +# Field names + +The `_field_names` field indexes field names that contain non-null values. This enables the use of the `exists` query, which can identify documents that either have or do not have non-null values for a specified field. + +However, `_field_names` only indexes field names when both `doc_values` and `norms` are disabled. If either `doc_values` or `norms` are enabled, then the `exists` query still functions but will not rely on the `_field_names` field. + +## Mapping example + +```json +{ + "mappings": { + "_field_names": { + "enabled": "true" + }, + "properties": { + }, + "title": { + "type": "text", + "doc_values": false, + "norms": false + }, + "description": { + "type": "text", + "doc_values": true, + "norms": false + }, + "price": { + "type": "float", + "doc_values": false, + "norms": true + } + } + } +} +``` +{% include copy-curl.html %} diff --git a/_field-types/metadata-fields/id.md b/_field-types/metadata-fields/id.md new file mode 100644 index 00000000000..f66f4b8e133 --- /dev/null +++ b/_field-types/metadata-fields/id.md @@ -0,0 +1,86 @@ +--- +layout: default +title: ID +nav_order: 20 +parent: Metadata fields +--- + +# ID + +Each document in OpenSearch has a unique `_id` field. This field is indexed, allowing you to retrieve documents using the GET API or the [`ids` query]({{site.url}}{{site.baseurl}}/query-dsl/term/ids/). + +If you do not provide an `_id` value, then OpenSearch automatically generates one for the document. +{: .note} + +The following example request creates an index named `test-index1` and adds two documents with different `_id` values: + +```json +PUT test-index1/_doc/1 +{ + "text": "Document with ID 1" +} + +PUT test-index1/_doc/2?refresh=true +{ + "text": "Document with ID 2" +} +``` +{% include copy-curl.html %} + +You can then query the documents using the `_id` field, as shown in the following example request: + +```json +GET test-index1/_search +{ + "query": { + "terms": { + "_id": ["1", "2"] + } + } +} +``` +{% include copy-curl.html %} + +The response returns both documents with `_id` values of `1` and `2`: + +```json +{ + "took": 10, + "timed_out": false, + "_shards": { + "total": 1, + "successful": 1, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 2, + "relation": "eq" + }, + "max_score": 1, + "hits": [ + { + "_index": "test-index1", + "_id": "1", + "_score": 1, + "_source": { + "text": "Document with ID 1" + } + }, + { + "_index": "test-index1", + "_id": "2", + "_score": 1, + "_source": { + "text": "Document with ID 2" + } + } + ] + } +``` +{% include copy-curl.html %} + +## Limitations of the `_id` field + +While the `_id` field can be used in various queries, it is restricted from use in aggregations, sorting, and scripting. If you need to sort or aggregate on the `_id` field, it is recommended to duplicate the `_id` content into another field with `doc_values` enabled. Refer to [IDs query]({{site.url}}{{site.baseurl}}/query-dsl/term/ids/) for an example. diff --git a/_field-types/metadata-fields/ignored.md b/_field-types/metadata-fields/ignored.md new file mode 100644 index 00000000000..e867cfc754f --- /dev/null +++ b/_field-types/metadata-fields/ignored.md @@ -0,0 +1,147 @@ +--- +layout: default +title: Ignored +nav_order: 25 +parent: Metadata fields +--- + +# Ignored + +The `_ignored` field helps you manage issues related to malformed data in your documents. This field is used to index and store field names that were ignored during the indexing process as a result of the `ignore_malformed` setting being enabled in the [index mapping]({{site.url}}{{site.baseurl}}/field-types/). + +The `_ignored` field allows you to search for and identify documents containing fields that were ignored as well as for the specific field names that were ignored. This can be useful for troubleshooting. + +You can query the `_ignored` field using the `term`, `terms`, and `exists` queries, and the results will be included in the search hits. + +The `_ignored` field is only populated when the `ignore_malformed` setting is enabled in your index mapping. If `ignore_malformed` is set to `false` (the default value), then malformed fields will cause the entire document to be rejected, and the `_ignored` field will not be populated. +{: .note} + +The following example request shows you how to use the `_ignored` field: + +```json +GET _search +{ + "query": { + "exists": { + "field": "_ignored" + } + } +} +``` +{% include copy-curl.html %} + +--- + +#### Example indexing request with the `_ignored` field + +The following example request adds a new document to the `test-ignored` index with `ignore_malformed` set to `true` so that no error is thrown during indexing: + +```json +PUT test-ignored +{ + "mappings": { + "properties": { + "title": { + "type": "text" + }, + "length": { + "type": "long", + "ignore_malformed": true + } + } + } +} + +POST test-ignored/_doc +{ + "title": "correct text", + "length": "not a number" +} + +GET test-ignored/_search +{ + "query": { + "exists": { + "field": "_ignored" + } + } +} +``` +{% include copy-curl.html %} + +#### Example reponse + +```json +{ + "took": 42, + "timed_out": false, + "_shards": { + "total": 1, + "successful": 1, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 1, + "relation": "eq" + }, + "max_score": 1, + "hits": [ + { + "_index": "test-ignored", + "_id": "qcf0wZABpEYH7Rw9OT7F", + "_score": 1, + "_ignored": [ + "length" + ], + "_source": { + "title": "correct text", + "length": "not a number" + } + } + ] + } +} +``` + +--- + +## Ignoring a specified field + +You can use a `term` query to find documents in which a specific field was ignored, as shown in the following example request: + +```json +GET _search +{ + "query": { + "term": { + "_ignored": "created_at" + } + } +} +``` +{% include copy-curl.html %} + +#### Reponse + +```json +{ + "took": 51, + "timed_out": false, + "_shards": { + "total": 45, + "successful": 45, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 0, + "relation": "eq" + }, + "max_score": null, + "hits": [] + } +} +``` diff --git a/_field-types/metadata-fields/index-metadata.md b/_field-types/metadata-fields/index-metadata.md new file mode 100644 index 00000000000..657f7d62a59 --- /dev/null +++ b/_field-types/metadata-fields/index-metadata.md @@ -0,0 +1,86 @@ +--- +layout: default +title: Index +nav_order: 25 +parent: Metadata fields +--- + +# Index + +When querying across multiple indexes, you may need to filter results based on the index into which a document was indexed. The `index` field matches documents based on their index. + +The following example request creates two indexes, `products` and `customers`, and adds a document to each index: + +```json +PUT products/_doc/1 +{ + "name": "Widget X" +} + +PUT customers/_doc/2 +{ + "name": "John Doe" +} +``` +{% include copy-curl.html %} + +You can then query both indexes and filter the results using the `_index` field, as shown in the following example request: + +```json +GET products,customers/_search +{ + "query": { + "terms": { + "_index": ["products", "customers"] + } + }, + "aggs": { + "index_groups": { + "terms": { + "field": "_index", + "size": 10 + } + } + }, + "sort": [ + { + "_index": { + "order": "desc" + } + } + ], + "script_fields": { + "index_name": { + "script": { + "lang": "painless", + "source": "doc['_index'].value" + } + } + } +} +``` +{% include copy-curl.html %} + +In this example: + +- The `query` section uses a `terms` query to match documents from the `products` and `customers` indexes. +- The `aggs` section performs a `terms` aggregation on the `_index` field, grouping the results by index. +- The `sort` section sorts the results by the `_index` field in ascending order. +- The `script_fields` section adds a new field called `index_name` to the search results containing the `_index` field value for each document. + +## Querying on the `_index` field + +The `_index` field represents the index into which a document was indexed. You can use this field in your queries to filter, aggregate, sort, or retrieve index information for your search results. + +Because the `_index` field is automatically added to every document, you can use it in your queries like any other field. For example, you can use the `terms` query to match documents from multiple indexes. The following example query returns all documents from the `products` and `customers` indexes: + +```json + { + "query": { + "terms": { + "_index": ["products", "customers"] + } + } +} +``` +{% include copy-curl.html %} diff --git a/_field-types/metadata-fields/index.md b/_field-types/metadata-fields/index.md new file mode 100644 index 00000000000..cdc079e1e5a --- /dev/null +++ b/_field-types/metadata-fields/index.md @@ -0,0 +1,21 @@ +--- +layout: default +title: Metadata fields +nav_order: 90 +has_children: true +has_toc: false +--- + +# Metadata fields + +OpenSearch provides built-in metadata fields that allow you to access information about the documents in an index. These fields can be used in your queries as needed. + +Metadata field | Description +:--- | :--- +`_field_names` | The document fields with non-empty or non-null values. +`_ignored` | The document fields that were ignored during the indexing process due to the presence of malformed data, as specified by the `ignore_malformed` setting. +`_id` | The unique identifier assigned to each document. +`_index` | The index in which the document is stored. +`_meta` | Stores custom metadata or additional information specific to the application or use case. +`_routing` | Allows you to specify a custom value that determines the shard assignment for a document in an OpenSearch cluster. +`_source` | Contains the original JSON representation of the document data. diff --git a/_field-types/metadata-fields/meta.md b/_field-types/metadata-fields/meta.md new file mode 100644 index 00000000000..220d58f1068 --- /dev/null +++ b/_field-types/metadata-fields/meta.md @@ -0,0 +1,87 @@ +--- +layout: default +title: Meta +nav_order: 30 +parent: Metadata fields +--- + +# Meta + +The `_meta` field is a mapping property that allows you to attach custom metadata to your index mappings. This metadata can be used by your application to store information relevant to your use case, such as versioning, ownership, categorization, or auditing. + +## Usage + +You can define the `_meta` field when creating a new index or updating an existing index's mapping, as shown in the following example request: + +```json +PUT my-index +{ + "mappings": { + "_meta": { + "application": "MyApp", + "version": "1.2.3", + "author": "John Doe" + }, + "properties": { + "title": { + "type": "text" + }, + "description": { + "type": "text" + } + } + } +} + +``` +{% include copy-curl.html %} + +In this example, three custom metadata fields are added: `application`, `version`, and `author`. These fields can be used by your application to store any relevant information about the index, such as the application it belongs to, the application version, or the author of the index. + +You can update the `_meta` field using the [Put Mapping API]({{site.url}}{{site.baseurl}}/api-reference/index-apis/put-mapping/) operation, as shown in the following example request: + +```json +PUT my-index/_mapping +{ + "_meta": { + "application": "MyApp", + "version": "1.3.0", + "author": "Jane Smith" + } +} +``` +{% include copy-curl.html %} + +## Retrieving `meta` information + +You can retrieve the `_meta` information for an index using the [Get Mapping API]({{site.url}}{{site.baseurl}}/field-types/#get-a-mapping) operation, as shown in the following example request: + +```json +GET my-index/_mapping +``` +{% include copy-curl.html %} + +The response returns the full index mapping, including the `_meta` field: + +```json +{ + "my-index": { + "mappings": { + "_meta": { + "application": "MyApp", + "version": "1.3.0", + "author": "Jane Smith" + }, + "properties": { + "description": { + "type": "text" + }, + "title": { + "type": "text" + } + } + } + } +} +``` +{% include copy-curl.html %} diff --git a/_field-types/metadata-fields/routing.md b/_field-types/metadata-fields/routing.md new file mode 100644 index 00000000000..9064e20c497 --- /dev/null +++ b/_field-types/metadata-fields/routing.md @@ -0,0 +1,92 @@ +--- +layout: default +title: Routing +nav_order: 35 +parent: Metadata fields +--- + +# Routing + +OpenSearch uses a hashing algorithm to route documents to specific shards in an index. By default, the document's `_id` field is used as the routing value, but you can also specify a custom routing value for each document. + +## Default routing + +The following is the default OpenSearch routing formula. The `_routing` value is the document's `_id`. + +```json +shard_num = hash(_routing) % num_primary_shards +``` + +## Custom routing + +You can specify a custom routing value when indexing a document, as shown in the following example request: + +```json +PUT sample-index1/_doc/1?routing=JohnDoe1 +{ + "title": "This is a document" +} +``` +{% include copy-curl.html %} + +In this example, the document is routed using the value `JohnDoe1` instead of the default `_id`. + +You must provide the same routing value when retrieving, deleting, or updating the document, as shown in the following example request: + +```json +GET sample-index1/_doc/1?routing=JohnDoe1 +``` +{% include copy-curl.html %} + +## Querying by routing + +You can query documents based on their routing value by using the `_routing` field, as shown in the following example. This query only searches the shard(s) associated with the `JohnDoe1` routing value: + +```json +GET sample-index1/_search +{ + "query": { + "terms": { + "_routing": [ "JohnDoe1" ] + } + } +} +``` +{% include copy-curl.html %} + +## Required routing + +You can make custom routing required for all CRUD operations on an index, as shown in the following example request. If you try to index a document without providing a routing value, OpenSearch will throw an exception. + +```json +PUT sample-index2 +{ + "mappings": { + "_routing": { + "required": true + } + } +} +``` +{% include copy-curl.html %} + +## Routing to specific shards + +You can configure an index to route custom values to a subset of shards rather than a single shard. This is done by setting `index.routing_partition_size` at the time of index creation. The formula for calculating the shard is `shard_num = (hash(_routing) + hash(_id)) % routing_partition_size) % num_primary_shards`. + +The following example request routes documents to one of four shards in the index: + +```json +PUT sample-index3 +{ + "settings": { + "index.routing_partition_size": 4 + }, + "mappings": { + "_routing": { + "required": true + } + } +} +``` +{% include copy-curl.html %} diff --git a/_field-types/metadata-fields/source.md b/_field-types/metadata-fields/source.md new file mode 100644 index 00000000000..c9e714f43ce --- /dev/null +++ b/_field-types/metadata-fields/source.md @@ -0,0 +1,54 @@ +--- +layout: default +title: Source +nav_order: 40 +parent: Metadata fields +--- + +# Source + +The `_source` field contains the original JSON document body that was indexed. While this field is not searchable, it is stored so that the full document can be returned when executing fetch requests, such as `get` and `search`. + +## Disabling the field + +You can disable the `_source` field by setting the `enabled` parameter to `false`, as shown in the following example request: + +```json +PUT sample-index1 +{ + "mappings": { + "_source": { + "enabled": false + } + } +} +``` +{% include copy-curl.html %} + +Disabling the `_source` field can impact the availability of certain features, such as the `update`, `update_by_query`, and `reindex` APIs, as well as the ability to debug queries or aggregations using the original indexed document. +{: .warning} + +## Including or excluding fields + +You can selectively control the contents of the `_source` field by using the `includes` and `excludes` parameters. This allows you to prune the stored `_source` field after it is indexed but before it is saved, as shown in the following example request: + +```json +PUT logs +{ + "mappings": { + "_source": { + "includes": [ + "*.count", + "meta.*" + ], + "excludes": [ + "meta.description", + "meta.other.*" + ] + } + } +} +``` +{% include copy-curl.html %} + +These fields are not stored in the `_source`, but you can still search them because the data remains indexed. diff --git a/_field-types/supported-field-types/alias.md b/_field-types/supported-field-types/alias.md index 29cc58885c9..f1f6ae9ac8f 100644 --- a/_field-types/supported-field-types/alias.md +++ b/_field-types/supported-field-types/alias.md @@ -10,6 +10,8 @@ redirect_from: --- # Alias field type +**Introduced 1.0** +{: .label .label-purple } An alias field type creates another name for an existing field. You can use aliases in the[search](#using-aliases-in-search-api-operations) and [field capabilities](#using-aliases-in-field-capabilities-api-operations) API operations, with some [exceptions](#exceptions). To set up an [alias](#alias-field), you need to specify the [original field](#original-field) name in the `path` parameter. diff --git a/_field-types/supported-field-types/binary.md b/_field-types/supported-field-types/binary.md index d6974ad4cf6..bb257bf7ec7 100644 --- a/_field-types/supported-field-types/binary.md +++ b/_field-types/supported-field-types/binary.md @@ -10,6 +10,8 @@ redirect_from: --- # Binary field type +**Introduced 1.0** +{: .label .label-purple } A binary field type contains a binary value in [Base64](https://en.wikipedia.org/wiki/Base64) encoding that is not searchable. @@ -50,5 +52,5 @@ The following table lists the parameters accepted by binary field types. All par Parameter | Description :--- | :--- -`doc_values` | A Boolean value that specifies whether the field should be stored on disk so that it can be used for aggregations, sorting, or scripting. Optional. Default is `true`. -`store` | A Boolean value that specifies whether the field value should be stored and can be retrieved separately from the _source field. Optional. Default is `false`. \ No newline at end of file +`doc_values` | A Boolean value that specifies whether the field should be stored on disk so that it can be used for aggregations, sorting, or scripting. Optional. Default is `false`. +`store` | A Boolean value that specifies whether the field value should be stored and can be retrieved separately from the _source field. Optional. Default is `false`. diff --git a/_field-types/supported-field-types/boolean.md b/_field-types/supported-field-types/boolean.md index 8233a45ad52..82cfdecf475 100644 --- a/_field-types/supported-field-types/boolean.md +++ b/_field-types/supported-field-types/boolean.md @@ -10,6 +10,8 @@ redirect_from: --- # Boolean field type +**Introduced 1.0** +{: .label .label-purple } A Boolean field type takes `true` or `false` values, or `"true"` or `"false"` strings. You can also pass an empty string (`""`) in place of a `false` value. diff --git a/_field-types/supported-field-types/completion.md b/_field-types/supported-field-types/completion.md index 85c803baa1c..e6e392fb6d3 100644 --- a/_field-types/supported-field-types/completion.md +++ b/_field-types/supported-field-types/completion.md @@ -11,6 +11,8 @@ redirect_from: --- # Completion field type +**Introduced 1.0** +{: .label .label-purple } A completion field type provides autocomplete functionality through a completion suggester. The completion suggester is a prefix suggester, so it matches the beginning of text only. A completion suggester creates an in-memory data structure, which provides faster lookups but leads to increased memory usage. You need to upload a list of all possible completions into the index before using this feature. diff --git a/_field-types/supported-field-types/constant-keyword.md b/_field-types/supported-field-types/constant-keyword.md index bf1e4afc702..4f9261f1a13 100644 --- a/_field-types/supported-field-types/constant-keyword.md +++ b/_field-types/supported-field-types/constant-keyword.md @@ -8,6 +8,8 @@ grand_parent: Supported field types --- # Constant keyword field type +**Introduced 2.14** +{: .label .label-purple } A constant keyword field uses the same value for all documents in the index. diff --git a/_field-types/supported-field-types/date-nanos.md b/_field-types/supported-field-types/date-nanos.md index 12399a69d40..eb569265fcc 100644 --- a/_field-types/supported-field-types/date-nanos.md +++ b/_field-types/supported-field-types/date-nanos.md @@ -8,6 +8,8 @@ grand_parent: Supported field types --- # Date nanoseconds field type +**Introduced 1.0** +{: .label .label-purple } The `date_nanos` field type is similar to the [`date`]({{site.url}}{{site.baseurl}}/opensearch/supported-field-types/date/) field type in that it holds a date. However, `date` stores the date in millisecond resolution, while `date_nanos` stores the date in nanosecond resolution. Dates are stored as `long` values that correspond to nanoseconds since the epoch. Therefore, the range of supported dates is approximately 1970--2262. diff --git a/_field-types/supported-field-types/date.md b/_field-types/supported-field-types/date.md index fb008d1512d..8ca986219b3 100644 --- a/_field-types/supported-field-types/date.md +++ b/_field-types/supported-field-types/date.md @@ -11,6 +11,8 @@ redirect_from: --- # Date field type +**Introduced 1.0** +{: .label .label-purple } A date in OpenSearch can be represented as one of the following: diff --git a/_field-types/supported-field-types/derived.md b/_field-types/supported-field-types/derived.md index d989c3e4a46..cb358f47f96 100644 --- a/_field-types/supported-field-types/derived.md +++ b/_field-types/supported-field-types/derived.md @@ -28,11 +28,11 @@ Despite the potential performance impact of query-time computations, the flexibi Currently, derived fields have the following limitations: -- **Aggregation, scoring, and sorting**: Not yet supported. +- **Scoring and sorting**: Not yet supported. +- **Aggregations**: Starting with OpenSearch 2.17, derived fields support most aggregation types. The following aggregations are not supported: geographic (geodistance, geohash grid, geohex grid, geotile grid, geobounds, geocentroid), significant terms, significant text, and scripted metric. - **Dashboard support**: These fields are not displayed in the list of available fields in OpenSearch Dashboards. However, you can still use them for filtering if you know the derived field name. - **Chained derived fields**: One derived field cannot be used to define another derived field. - **Join field type**: Derived fields are not supported for the [join field type]({{site.url}}{{site.baseurl}}/opensearch/supported-field-types/join/). -- **Concurrent segment search**: Derived fields are not supported for [concurrent segment search]({{site.url}}{{site.baseurl}}/search-plugins/concurrent-segment-search/). We are planning to address these limitations in future versions. @@ -541,6 +541,80 @@ The response specifies highlighting in the `url` field: ``` +## Aggregations + +Starting with OpenSearch 2.17, derived fields support most aggregation types. + +Geographic, significant terms, significant text, and scripted metric aggregations are not supported. +{: .note} + +For example, the following request creates a simple `terms` aggregation on the `method` derived field: + +```json +POST /logs/_search +{ + "size": 0, + "aggs": { + "methods": { + "terms": { + "field": "method" + } + } + } +} +``` +{% include copy-curl.html %} + +The response contains the following buckets: + +
+ + Response + + {: .text-delta} + +```json +{ + "took" : 12, + "timed_out" : false, + "_shards" : { + "total" : 1, + "successful" : 1, + "skipped" : 0, + "failed" : 0 + }, + "hits" : { + "total" : { + "value" : 5, + "relation" : "eq" + }, + "max_score" : null, + "hits" : [ ] + }, + "aggregations" : { + "methods" : { + "doc_count_error_upper_bound" : 0, + "sum_other_doc_count" : 0, + "buckets" : [ + { + "key" : "GET", + "doc_count" : 2 + }, + { + "key" : "POST", + "doc_count" : 2 + }, + { + "key" : "DELETE", + "doc_count" : 1 + } + ] + } + } +} +``` +
+ ## Performance Derived fields are not indexed but are computed dynamically by retrieving values from the `_source` field or doc values. Thus, they run more slowly. To improve performance, try the following: diff --git a/_field-types/supported-field-types/flat-object.md b/_field-types/supported-field-types/flat-object.md index 933c385ac57..c9e59710e15 100644 --- a/_field-types/supported-field-types/flat-object.md +++ b/_field-types/supported-field-types/flat-object.md @@ -10,6 +10,8 @@ redirect_from: --- # Flat object field type +**Introduced 2.7** +{: .label .label-purple } In OpenSearch, you don't have to specify a mapping before indexing documents. If you don't specify a mapping, OpenSearch uses [dynamic mapping]({{site.url}}{{site.baseurl}}/field-types/index#dynamic-mapping) to map every field and its subfields in the document automatically. When you ingest documents such as logs, you may not know every field's subfield name and type in advance. In this case, dynamically mapping all new subfields can quickly lead to a "mapping explosion," where the growing number of fields may degrade the performance of your cluster. diff --git a/_field-types/supported-field-types/geo-point.md b/_field-types/supported-field-types/geo-point.md index 0912dc618d0..96586d044fa 100644 --- a/_field-types/supported-field-types/geo-point.md +++ b/_field-types/supported-field-types/geo-point.md @@ -11,6 +11,8 @@ redirect_from: --- # Geopoint field type +**Introduced 1.0** +{: .label .label-purple } A geopoint field type contains a geographic point specified by latitude and longitude. diff --git a/_field-types/supported-field-types/geo-shape.md b/_field-types/supported-field-types/geo-shape.md index b7b06a0d042..ee98bfca03e 100644 --- a/_field-types/supported-field-types/geo-shape.md +++ b/_field-types/supported-field-types/geo-shape.md @@ -11,6 +11,8 @@ redirect_from: --- # Geoshape field type +**Introduced 1.0** +{: .label .label-purple } A geoshape field type contains a geographic shape, such as a polygon or a collection of geographic points. To index a geoshape, OpenSearch tesselates the shape into a triangular mesh and stores each triangle in a BKD tree. This provides a 10-7decimal degree of precision, which represents near-perfect spatial resolution. Performance of this process is mostly impacted by the number of vertices in a polygon you are indexing. diff --git a/_field-types/supported-field-types/ip.md b/_field-types/supported-field-types/ip.md index cb2a5569c83..99b41e45cdb 100644 --- a/_field-types/supported-field-types/ip.md +++ b/_field-types/supported-field-types/ip.md @@ -10,6 +10,8 @@ redirect_from: --- # IP address field type +**Introduced 1.0** +{: .label .label-purple } An ip field type contains an IP address in IPv4 or IPv6 format. diff --git a/_field-types/supported-field-types/join.md b/_field-types/supported-field-types/join.md index c83705f4c33..009471a784a 100644 --- a/_field-types/supported-field-types/join.md +++ b/_field-types/supported-field-types/join.md @@ -11,6 +11,8 @@ redirect_from: --- # Join field type +**Introduced 1.0** +{: .label .label-purple } A join field type establishes a parent/child relationship between documents in the same index. @@ -59,7 +61,7 @@ PUT testindex1/_doc/1 ``` {% include copy-curl.html %} -When indexing child documents, you have to specify the `routing` query parameter because parent and child documents in the same relation have to be indexed on the same shard. Each child document refers to its parent's ID in the `parent` field. +When indexing child documents, you need to specify the `routing` query parameter because parent and child documents in the same parent/child hierarchy must be indexed on the same shard. For more information, see [Routing]({{site.url}}{{site.baseurl}}/field-types/metadata-fields/routing/). Each child document refers to its parent's ID in the `parent` field. Index two child documents, one for each parent: @@ -325,3 +327,8 @@ PUT testindex1 - Multiple parents are not supported. - You can add a child document to an existing document only if the existing document is already marked as a parent. - You can add a new relation to an existing join field. + +## Next steps + +- Learn about [joining queries]({{site.url}}{{site.baseurl}}/query-dsl/joining/) on join fields. +- Learn more about [retrieving inner hits]({{site.url}}{{site.baseurl}}/search-plugins/searching-data/inner-hits/). \ No newline at end of file diff --git a/_field-types/supported-field-types/keyword.md b/_field-types/supported-field-types/keyword.md index eea6cc664bc..ca9c8085f6f 100644 --- a/_field-types/supported-field-types/keyword.md +++ b/_field-types/supported-field-types/keyword.md @@ -11,6 +11,8 @@ redirect_from: --- # Keyword field type +**Introduced 1.0** +{: .label .label-purple } A keyword field type contains a string that is not analyzed. It allows only exact, case-sensitive matches. diff --git a/_field-types/supported-field-types/knn-vector.md b/_field-types/supported-field-types/knn-vector.md index a2a71377333..da784aeefe6 100644 --- a/_field-types/supported-field-types/knn-vector.md +++ b/_field-types/supported-field-types/knn-vector.md @@ -8,6 +8,8 @@ has_math: true --- # k-NN vector field type +**Introduced 1.0** +{: .label .label-purple } The [k-NN plugin]({{site.url}}{{site.baseurl}}/search-plugins/knn/index/) introduces a custom data type, the `knn_vector`, that allows users to ingest their k-NN vectors into an OpenSearch index and perform different kinds of k-NN search. The `knn_vector` field is highly configurable and can serve many different k-NN workloads. In general, a `knn_vector` field can be built either by providing a method definition or specifying a model id. @@ -20,8 +22,7 @@ PUT test-index { "settings": { "index": { - "knn": true, - "knn.algo_param.ef_search": 100 + "knn": true } }, "mappings": { @@ -29,14 +30,10 @@ PUT test-index "my_vector": { "type": "knn_vector", "dimension": 3, + "space_type": "l2", "method": { "name": "hnsw", - "space_type": "l2", - "engine": "lucene", - "parameters": { - "ef_construction": 128, - "m": 24 - } + "engine": "faiss" } } } @@ -45,6 +42,92 @@ PUT test-index ``` {% include copy-curl.html %} +## Vector workload modes + +Vector search involves trade-offs between low-latency and low-cost search. Specify the `mode` mapping parameter of the `knn_vector` type to indicate which search mode you want to prioritize. The `mode` dictates the default values for k-NN parameters. You can further fine-tune your index by overriding the default parameter values in the k-NN field mapping. + +The following modes are currently supported. + +| Mode | Default engine | Description | +|:---|:---|:---| +| `in_memory` (Default) | `nmslib` | Prioritizes low-latency search. This mode uses the `nmslib` engine without any quantization applied. It is configured with the default parameter values for vector search in OpenSearch. | +| `on_disk` | `faiss` | Prioritizes low-cost vector search while maintaining strong recall. By default, the `on_disk` mode uses quantization and rescoring to execute a two-pass approach to retrieve the top neighbors. The `on_disk` mode supports only `float` vector types. | + +To create a k-NN index that uses the `on_disk` mode for low-cost search, send the following request: + +```json +PUT test-index +{ + "settings": { + "index": { + "knn": true + } + }, + "mappings": { + "properties": { + "my_vector": { + "type": "knn_vector", + "dimension": 3, + "space_type": "l2", + "mode": "on_disk" + } + } + } +} +``` +{% include copy-curl.html %} + +## Compression levels + +The `compression_level` mapping parameter selects a quantization encoder that reduces vector memory consumption by the given factor. The following table lists the available `compression_level` values. + +| Compression level | Supported engines | +|:------------------|:-------------------------------| +| `1x` | `faiss`, `lucene`, and `nmslib` | +| `2x` | `faiss` | +| `4x` | `lucene` | +| `8x` | `faiss` | +| `16x` | `faiss` | +| `32x` | `faiss` | + +For example, if a `compression_level` of `32x` is passed for a `float32` index of 768-dimensional vectors, the per-vector memory is reduced from `4 * 768 = 3072` bytes to `3072 / 32 = 846` bytes. Internally, binary quantization (which maps a `float` to a `bit`) may be used to achieve this compression. + +If you set the `compression_level` parameter, then you cannot specify an `encoder` in the `method` mapping. Compression levels greater than `1x` are only supported for `float` vector types. +{: .note} + +The following table lists the default `compression_level` values for the available workload modes. + +| Mode | Default compression level | +|:------------------|:-------------------------------| +| `in_memory` | `1x` | +| `on_disk` | `32x` | + + +To create a vector field with a `compression_level` of `16x`, specify the `compression_level` parameter in the mappings. This parameter overrides the default compression level for the `on_disk` mode from `32x` to `16x`, producing higher recall and accuracy at the expense of a larger memory footprint: + +```json +PUT test-index +{ + "settings": { + "index": { + "knn": true + } + }, + "mappings": { + "properties": { + "my_vector": { + "type": "knn_vector", + "dimension": 3, + "space_type": "l2", + "mode": "on_disk", + "compression_level": "16x" + } + } + } +} +``` +{% include copy-curl.html %} + ## Method definitions [Method definitions]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index#method-definitions) are used when the underlying [approximate k-NN]({{site.url}}{{site.baseurl}}/search-plugins/knn/approximate-knn/) algorithm does not require training. For example, the following `knn_vector` field specifies that *nmslib*'s implementation of *hnsw* should be used for approximate k-NN search. During indexing, *nmslib* will build the corresponding *hnsw* segment files. @@ -53,13 +136,13 @@ PUT test-index "my_vector": { "type": "knn_vector", "dimension": 4, + "space_type": "l2", "method": { "name": "hnsw", - "space_type": "l2", "engine": "nmslib", "parameters": { - "ef_construction": 128, - "m": 24 + "ef_construction": 100, + "m": 16 } } } @@ -71,6 +154,7 @@ Model IDs are used when the underlying Approximate k-NN algorithm requires a tra model contains the information needed to initialize the native library segment files. ```json +"my_vector": { "type": "knn_vector", "model_id": "my-model" } @@ -78,28 +162,36 @@ model contains the information needed to initialize the native library segment f However, if you intend to use Painless scripting or a k-NN score script, you only need to pass the dimension. ```json +"my_vector": { "type": "knn_vector", "dimension": 128 } ``` -## Lucene byte vector +## Byte vectors -By default, k-NN vectors are `float` vectors, where each dimension is 4 bytes. If you want to save storage space, you can use `byte` vectors with the `lucene` engine. In a `byte` vector, each dimension is a signed 8-bit integer in the [-128, 127] range. +By default, k-NN vectors are `float` vectors, in which each dimension is 4 bytes. If you want to save storage space, you can use `byte` vectors with the `faiss` or `lucene` engine. In a `byte` vector, each dimension is a signed 8-bit integer in the [-128, 127] range. -Byte vectors are supported only for the `lucene` engine. They are not supported for the `nmslib` and `faiss` engines. +Byte vectors are supported only for the `lucene` and `faiss` engines. They are not supported for the `nmslib` engine. {: .note} -In [k-NN benchmarking tests](https://github.com/opensearch-project/k-NN/tree/main/benchmarks/perf-tool), the use of `byte` rather than `float` vectors resulted in a significant reduction in storage and memory usage as well as improved indexing throughput and reduced query latency. Additionally, precision on recall was not greatly affected (note that recall can depend on various factors, such as the [quantization technique](#quantization-techniques) and data distribution). +In [k-NN benchmarking tests](https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/vectorsearch), the use of `byte` rather than `float` vectors resulted in a significant reduction in storage and memory usage as well as improved indexing throughput and reduced query latency. Additionally, precision on recall was not greatly affected (note that recall can depend on various factors, such as the [quantization technique](#quantization-techniques) and data distribution). When using `byte` vectors, expect some loss of precision in the recall compared to using `float` vectors. Byte vectors are useful in large-scale applications and use cases that prioritize a reduced memory footprint in exchange for a minimal loss of recall. {: .important} - + +When using `byte` vectors with the `faiss` engine, we recommend using [SIMD optimization]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index#simd-optimization-for-the-faiss-engine), which helps to significantly reduce search latencies and improve indexing throughput. +{: .important} + Introduced in k-NN plugin version 2.9, the optional `data_type` parameter defines the data type of a vector. The default value of this parameter is `float`. To use a `byte` vector, set the `data_type` parameter to `byte` when creating mappings for an index: - ```json +### Example: HNSW + +The following example creates a byte vector index with the `lucene` engine and `hnsw` algorithm: + +```json PUT test-index { "settings": { @@ -114,13 +206,13 @@ PUT test-index "type": "knn_vector", "dimension": 3, "data_type": "byte", + "space_type": "l2", "method": { "name": "hnsw", - "space_type": "l2", "engine": "lucene", "parameters": { - "ef_construction": 128, - "m": 24 + "ef_construction": 100, + "m": 16 } } } @@ -130,7 +222,7 @@ PUT test-index ``` {% include copy-curl.html %} -Then ingest documents as usual. Make sure each dimension in the vector is in the supported [-128, 127] range: +After creating the index, ingest documents as usual. Make sure each dimension in the vector is in the supported [-128, 127] range: ```json PUT test-index/_doc/1 @@ -166,9 +258,160 @@ GET test-index/_search ``` {% include copy-curl.html %} +### Example: IVF + +The `ivf` method requires a training step that creates and trains the model used to initialize the native library index during segment creation. For more information, see [Building a k-NN index from a model]({{site.url}}{{site.baseurl}}/search-plugins/knn/approximate-knn/#building-a-k-nn-index-from-a-model). + +First, create an index that will contain byte vector training data. Specify the `faiss` engine and `ivf` algorithm and make sure that the `dimension` matches the dimension of the model you want to create: + +```json +PUT train-index +{ + "mappings": { + "properties": { + "train-field": { + "type": "knn_vector", + "dimension": 4, + "data_type": "byte" + } + } + } +} +``` +{% include copy-curl.html %} + +First, ingest training data containing byte vectors into the training index: + +```json +PUT _bulk +{ "index": { "_index": "train-index", "_id": "1" } } +{ "train-field": [127, 100, 0, -120] } +{ "index": { "_index": "train-index", "_id": "2" } } +{ "train-field": [2, -128, -10, 50] } +{ "index": { "_index": "train-index", "_id": "3" } } +{ "train-field": [13, -100, 5, 126] } +{ "index": { "_index": "train-index", "_id": "4" } } +{ "train-field": [5, 100, -6, -125] } +``` +{% include copy-curl.html %} + +Then, create and train the model named `byte-vector-model`. The model will be trained using the training data from the `train-field` in the `train-index`. Specify the `byte` data type: + +```json +POST _plugins/_knn/models/byte-vector-model/_train +{ + "training_index": "train-index", + "training_field": "train-field", + "dimension": 4, + "description": "model with byte data", + "data_type": "byte", + "method": { + "name": "ivf", + "engine": "faiss", + "space_type": "l2", + "parameters": { + "nlist": 1, + "nprobes": 1 + } + } +} +``` +{% include copy-curl.html %} + +To check the model training status, call the Get Model API: + +```json +GET _plugins/_knn/models/byte-vector-model?filter_path=state +``` +{% include copy-curl.html %} + +Once the training is complete, the `state` changes to `created`. + +Next, create an index that will initialize its native library indexes using the trained model: + +```json +PUT test-byte-ivf +{ + "settings": { + "index": { + "knn": true + } + }, + "mappings": { + "properties": { + "my_vector": { + "type": "knn_vector", + "model_id": "byte-vector-model" + } + } + } +} +``` +{% include copy-curl.html %} + +Ingest the data containing the byte vectors that you want to search into the created index: + +```json +PUT _bulk?refresh=true +{"index": {"_index": "test-byte-ivf", "_id": "1"}} +{"my_vector": [7, 10, 15, -120]} +{"index": {"_index": "test-byte-ivf", "_id": "2"}} +{"my_vector": [10, -100, 120, -108]} +{"index": {"_index": "test-byte-ivf", "_id": "3"}} +{"my_vector": [1, -2, 5, -50]} +{"index": {"_index": "test-byte-ivf", "_id": "4"}} +{"my_vector": [9, -7, 45, -78]} +{"index": {"_index": "test-byte-ivf", "_id": "5"}} +{"my_vector": [80, -70, 127, -128]} +``` +{% include copy-curl.html %} + +Finally, search the data. Be sure to provide a byte vector in the k-NN vector field: + +```json +GET test-byte-ivf/_search +{ + "size": 2, + "query": { + "knn": { + "my_vector": { + "vector": [100, -120, 50, -45], + "k": 2 + } + } + } +} +``` +{% include copy-curl.html %} + +### Memory estimation + +In the best-case scenario, byte vectors require 25% of the memory required by 32-bit vectors. + +#### HNSW memory estimation + +The memory required for Hierarchical Navigable Small Worlds (HNSW) is estimated to be `1.1 * (dimension + 8 * m)` bytes/vector, where `m` is the maximum number of bidirectional links created for each element during the construction of the graph. + +As an example, assume that you have 1 million vectors with a dimension of 256 and an `m` of 16. The memory requirement can be estimated as follows: + +```r +1.1 * (256 + 8 * 16) * 1,000,000 ~= 0.39 GB +``` + +#### IVF memory estimation + +The memory required for IVF is estimated to be `1.1 * ((dimension * num_vectors) + (4 * nlist * dimension))` bytes/vector, where `nlist` is the number of buckets to partition vectors into. + +As an example, assume that you have 1 million vectors with a dimension of 256 and an `nlist` of 128. The memory requirement can be estimated as follows: + +```r +1.1 * ((256 * 1,000,000) + (4 * 128 * 256)) ~= 0.27 GB +``` + + ### Quantization techniques -If your vectors are of the type `float`, you need to first convert them to the `byte` type before ingesting the documents. This conversion is accomplished by _quantizing the dataset_---reducing the precision of its vectors. There are many quantization techniques, such as scalar quantization or product quantization (PQ), which is used in the Faiss engine. The choice of quantization technique depends on the type of data you're using and can affect the accuracy of recall values. The following sections describe the scalar quantization algorithms that were used to quantize the [k-NN benchmarking test](https://github.com/opensearch-project/k-NN/tree/main/benchmarks/perf-tool) data for the [L2](#scalar-quantization-for-the-l2-space-type) and [cosine similarity](#scalar-quantization-for-the-cosine-similarity-space-type) space types. The provided pseudocode is for illustration purposes only. +If your vectors are of the type `float`, you need to first convert them to the `byte` type before ingesting the documents. This conversion is accomplished by _quantizing the dataset_---reducing the precision of its vectors. There are many quantization techniques, such as scalar quantization or product quantization (PQ), which is used in the Faiss engine. The choice of quantization technique depends on the type of data you're using and can affect the accuracy of recall values. The following sections describe the scalar quantization algorithms that were used to quantize the [k-NN benchmarking test](https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/vectorsearch) data for the [L2](#scalar-quantization-for-the-l2-space-type) and [cosine similarity](#scalar-quantization-for-the-cosine-similarity-space-type) space types. The provided pseudocode is for illustration purposes only. #### Scalar quantization for the L2 space type @@ -267,7 +510,7 @@ return Byte(bval) ``` {% include copy.html %} -## Binary k-NN vectors +## Binary vectors You can reduce memory costs by a factor of 32 by switching from float to binary vectors. Using binary vector indexes can lower operational costs while maintaining high recall performance, making large-scale deployment more economical and efficient. @@ -305,14 +548,10 @@ PUT /test-binary-hnsw "type": "knn_vector", "dimension": 8, "data_type": "binary", + "space_type": "hamming", "method": { "name": "hnsw", - "space_type": "hamming", - "engine": "faiss", - "parameters": { - "ef_construction": 128, - "m": 24 - } + "engine": "faiss" } } } @@ -535,12 +774,12 @@ POST _plugins/_knn/models/test-binary-model/_train "dimension": 8, "description": "model with binary data", "data_type": "binary", + "space_type": "hamming", "method": { "name": "ivf", "engine": "faiss", - "space_type": "hamming", "parameters": { - "nlist": 1, + "nlist": 16, "nprobes": 1 } } diff --git a/_field-types/supported-field-types/match-only-text.md b/_field-types/supported-field-types/match-only-text.md index fd2c6b58500..534275bd3ac 100644 --- a/_field-types/supported-field-types/match-only-text.md +++ b/_field-types/supported-field-types/match-only-text.md @@ -8,6 +8,8 @@ grand_parent: Supported field types --- # Match-only text field type +**Introduced 2.12** +{: .label .label-purple } A `match_only_text` field is a variant of a `text` field designed for full-text search when scoring and positional information of terms within a document are not critical. diff --git a/_field-types/supported-field-types/nested.md b/_field-types/supported-field-types/nested.md index 90d09177d14..4db270c1dcc 100644 --- a/_field-types/supported-field-types/nested.md +++ b/_field-types/supported-field-types/nested.md @@ -11,6 +11,8 @@ redirect_from: --- # Nested field type +**Introduced 1.0** +{: .label .label-purple } A nested field type is a special type of [object field type]({{site.url}}{{site.baseurl}}/opensearch/supported-field-types/object/). @@ -312,3 +314,8 @@ Parameter | Description `include_in_parent` | A Boolean value that specifies whether all fields in the child nested object should also be added to the parent document in flattened form. Default is `false`. `include_in_root` | A Boolean value that specifies whether all fields in the child nested object should also be added to the root document in flattened form. Default is `false`. `properties` | Fields of this object, which can be of any supported type. New properties can be dynamically added to this object if `dynamic` is set to `true`. + +## Next steps + +- Learn about [joining queries]({{site.url}}{{site.baseurl}}/query-dsl/joining/) on nested fields. +- Learn about [retrieving inner hits]({{site.url}}{{site.baseurl}}/search-plugins/searching-data/inner-hits/). \ No newline at end of file diff --git a/_field-types/supported-field-types/object.md b/_field-types/supported-field-types/object.md index db539a96081..ee50f5af9d2 100644 --- a/_field-types/supported-field-types/object.md +++ b/_field-types/supported-field-types/object.md @@ -11,6 +11,8 @@ redirect_from: --- # Object field type +**Introduced 1.0** +{: .label .label-purple } An object field type contains a JSON object (a set of name/value pairs). A value in a JSON object may be another JSON object. It is not necessary to specify `object` as the type when mapping object fields because `object` is the default type. diff --git a/_field-types/supported-field-types/percolator.md b/_field-types/supported-field-types/percolator.md index 92325b61275..2b067cf5951 100644 --- a/_field-types/supported-field-types/percolator.md +++ b/_field-types/supported-field-types/percolator.md @@ -10,6 +10,8 @@ redirect_from: --- # Percolator field type +**Introduced 1.0** +{: .label .label-purple } A percolator field type specifies to treat this field as a query. Any JSON object field can be marked as a percolator field. Normally, documents are indexed and searches are run against them. When you use a percolator field, you store a search, and later the percolate query matches documents to that search. diff --git a/_field-types/supported-field-types/range.md b/_field-types/supported-field-types/range.md index 22ae1d619eb..1001bae5845 100644 --- a/_field-types/supported-field-types/range.md +++ b/_field-types/supported-field-types/range.md @@ -10,6 +10,8 @@ redirect_from: --- # Range field types +**Introduced 1.0** +{: .label .label-purple } The following table lists all range field types that OpenSearch supports. diff --git a/_field-types/supported-field-types/rank.md b/_field-types/supported-field-types/rank.md index a4ec0fac4ce..f57c540cf52 100644 --- a/_field-types/supported-field-types/rank.md +++ b/_field-types/supported-field-types/rank.md @@ -10,6 +10,8 @@ redirect_from: --- # Rank field types +**Introduced 1.0** +{: .label .label-purple } The following table lists all rank field types that OpenSearch supports. diff --git a/_field-types/supported-field-types/search-as-you-type.md b/_field-types/supported-field-types/search-as-you-type.md index b9141e6b8ed..55774d432a9 100644 --- a/_field-types/supported-field-types/search-as-you-type.md +++ b/_field-types/supported-field-types/search-as-you-type.md @@ -11,6 +11,8 @@ redirect_from: --- # Search-as-you-type field type +**Introduced 1.0** +{: .label .label-purple } A search-as-you-type field type provides search-as-you-type functionality using both prefix and infix completion. diff --git a/_field-types/supported-field-types/text.md b/_field-types/supported-field-types/text.md index 16350c0cb35..b06bec2187c 100644 --- a/_field-types/supported-field-types/text.md +++ b/_field-types/supported-field-types/text.md @@ -11,6 +11,8 @@ redirect_from: --- # Text field type +**Introduced 1.0** +{: .label .label-purple } A `text` field type contains a string that is analyzed. It is used for full-text search because it allows partial matches. Searches for multiple terms can match some but not all of them. Depending on the analyzer, results can be case insensitive, stemmed, have stopwords removed, have synonyms applied, and so on. diff --git a/_field-types/supported-field-types/token-count.md b/_field-types/supported-field-types/token-count.md index 6c3445e6a75..11eeff78544 100644 --- a/_field-types/supported-field-types/token-count.md +++ b/_field-types/supported-field-types/token-count.md @@ -11,6 +11,8 @@ redirect_from: --- # Token count field type +**Introduced 1.0** +{: .label .label-purple } A token count field type stores the number of analyzed tokens in a string. diff --git a/_field-types/supported-field-types/unsigned-long.md b/_field-types/supported-field-types/unsigned-long.md index dde8d25dee2..4c38cb3090c 100644 --- a/_field-types/supported-field-types/unsigned-long.md +++ b/_field-types/supported-field-types/unsigned-long.md @@ -8,6 +8,8 @@ has_children: false --- # Unsigned long field type +**Introduced 2.8** +{: .label .label-purple } The `unsigned_long` field type is a numeric field type that represents an unsigned 64-bit integer with a minimum value of 0 and a maximum value of 264 − 1. In the following example, `counter` is mapped as an `unsigned_long` field: diff --git a/_field-types/supported-field-types/wildcard.md b/_field-types/supported-field-types/wildcard.md index c438f35c623..0f8c1761354 100644 --- a/_field-types/supported-field-types/wildcard.md +++ b/_field-types/supported-field-types/wildcard.md @@ -8,6 +8,8 @@ grand_parent: Supported field types --- # Wildcard field type +**Introduced 2.15** +{: .label .label-purple } A `wildcard` field is a variant of a `keyword` field designed for arbitrary substring and regular expression matching. diff --git a/_field-types/supported-field-types/xy-point.md b/_field-types/supported-field-types/xy-point.md index 57b6f647584..0d066b9f099 100644 --- a/_field-types/supported-field-types/xy-point.md +++ b/_field-types/supported-field-types/xy-point.md @@ -11,6 +11,8 @@ redirect_from: --- # xy point field type +**Introduced 2.4** +{: .label .label-purple } An xy point field type contains a point in a two-dimensional Cartesian coordinate system, specified by x and y coordinates. It is based on the Lucene [XYPoint](https://lucene.apache.org/core/9_3_0/core/org/apache/lucene/geo/XYPoint.html) field type. The xy point field type is similar to the [geopoint]({{site.url}}{{site.baseurl}}/opensearch/supported-field-types/geo-point/) field type, but does not have the range limitations of geopoint. The coordinates of an xy point are single-precision floating-point values. For information about the range and precision of floating-point values, see [Numeric field types]({{site.url}}{{site.baseurl}}/opensearch/supported-field-types/numeric/). diff --git a/_field-types/supported-field-types/xy-shape.md b/_field-types/supported-field-types/xy-shape.md index f1c7191240c..9dcbafceaed 100644 --- a/_field-types/supported-field-types/xy-shape.md +++ b/_field-types/supported-field-types/xy-shape.md @@ -11,6 +11,8 @@ redirect_from: --- # xy shape field type +**Introduced 2.4** +{: .label .label-purple } An xy shape field type contains a shape, such as a polygon or a collection of xy points. It is based on the Lucene [XYShape](https://lucene.apache.org/core/9_3_0/core/org/apache/lucene/document/XYShape.html) field type. To index an xy shape, OpenSearch tessellates the shape into a triangular mesh and stores each triangle in a BKD tree (a set of balanced k-dimensional trees). This provides a 10-7decimal degree of precision, which represents near-perfect spatial resolution. diff --git a/_getting-started/communicate.md b/_getting-started/communicate.md index 9960f63b2cd..3472270c305 100644 --- a/_getting-started/communicate.md +++ b/_getting-started/communicate.md @@ -200,7 +200,7 @@ PUT /students/_doc/1 "address": "123 Main St." } ``` -{% include copy.html %} +{% include copy-curl.html %} Alternatively, you can update parts of a document by calling the Update Document API: diff --git a/_getting-started/intro.md b/_getting-started/intro.md index edd178a23f9..f5eb24ba2be 100644 --- a/_getting-started/intro.md +++ b/_getting-started/intro.md @@ -56,6 +56,7 @@ ID | Name | GPA | Graduation year 1 | John Doe | 3.89 | 2022 2 | Jonathan Powers | 3.85 | 2025 3 | Jane Doe | 3.52 | 2024 +... | | | ## Clusters and nodes diff --git a/_im-plugin/index-context.md b/_im-plugin/index-context.md new file mode 100644 index 00000000000..be0dbd527d0 --- /dev/null +++ b/_im-plugin/index-context.md @@ -0,0 +1,175 @@ +--- +layout: default +title: Index context +nav_order: 14 +redirect_from: + - /opensearch/index-context/ +--- + +# Index context + +This is an experimental feature and is not recommended for use in a production environment. For updates on the progress the feature or if you want to leave feedback, join the discussion on the [OpenSearch forum](https://forum.opensearch.org/). +{: .warning} + +Index context declares the use case for an index. Using the context information, OpenSearch applies a predetermined set of settings and mappings, which provides the following benefits: + +- Optimized performance +- Settings tuned to your specific use case +- Accurate mappings and aliases based on [OpenSearch Integrations]({{site.url}}{{site.baseurl}}/integrations/) + +The settings and metadata configuration that are applied using component templates are automatically loaded when your cluster starts. Component templates that start with `@abc_template@` or Application-Based Configuration (ABC) templates can only be used through a `context` object declaration, in order to prevent configuration issues. +{: .warning} + + +## Installation + +To install the index context feature: + +1. Install the `opensearch-system-templates` plugin on all nodes in your cluster using one of the [installation methods]({{site.url}}{{site.baseurl}}/install-and-configure/plugins/#install). + +2. Set the feature flag `opensearch.experimental.feature.application_templates.enabled` to `true`. For more information about enabling and disabling feature flags, see [Enabling experimental features]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/experimental/). + +3. Set the `cluster.application_templates.enabled` setting to `true`. For instructions on how to configure OpenSearch, see [configuring settings]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/index/#static-settings). + +## Using the `context` setting + +Use the `context` setting with the Index API to add use-case-specific context. + +### Considerations + +Consider the following when using the `context` parameter during index creation: + +1. If you use the `context` parameter to create an index, you cannot include any settings declared in the index context during index creation or dynamic settings updates. +2. The index context becomes permanent when set on an index or index template. + +When you adhere to these limitations, suggested configurations or mappings are uniformly applied on indexed data within the specified context. + +### Examples + +The following examples show how to use index context. + + +#### Create an index + +The following example request creates an index in which to store metric data by declaring a `metrics` mapping as the context: + +```json +PUT /my-metrics-index +{ + "context": { + "name": "metrics" + } +} +``` +{% include copy-curl.html %} + +After creation, the context is added to the index and the corresponding settings are applied: + + +**GET request** + +```json +GET /my-metrics-index +``` +{% include copy-curl.html %} + + +**Response** + +```json +{ + "my-metrics-index": { + "aliases": {}, + "mappings": {}, + "settings": { + "index": { + "codec": "zstd_no_dict", + "refresh_interval": "60s", + "number_of_shards": "1", + "provided_name": "my-metrics-index", + "merge": { + "policy": "log_byte_size" + }, + "context": { + "created_version": "1", + "current_version": "1" + }, + ... + } + }, + "context": { + "name": "metrics", + "version": "_latest" + } + } +} +``` + + +#### Create an index template + +You can also use the `context` parameter when creating an index template. The following example request creates an index template with the context information as `logs`: + +```json +PUT _index_template/my-logs +{ + "context": { + "name": "logs", + "version": "1" + }, + "index_patterns": [ + "my-logs-*" + ] +} +``` +{% include copy-curl.html %} + +All indexes created using this index template will get the metadata provided by the associated component template. The following request and response show how `context` is added to the template: + +**Get index template** + +```json +GET _index_template/my-logs +``` +{% include copy-curl.html %} + +**Response** + +```json +{ + "index_templates": [ + { + "name": "my-logs2", + "index_template": { + "index_patterns": [ + "my-logs1-*" + ], + "context": { + "name": "logs", + "version": "1" + } + } + } + ] +} +``` + +If there is any conflict between any settings, mappings, or aliases directly declared by your template and the backing component template for the context, the latter gets higher priority during index creation. + + +## Available context templates + +The following templates are available to be used through the `context` parameter as of OpenSearch 2.17: + +- `logs` +- `metrics` +- `nginx-logs` +- `amazon-cloudtrail-logs` +- `amazon-elb-logs` +- `amazon-s3-logs` +- `apache-web-logs` +- `k8s-logs` + +For more information about these templates, see the [OpenSearch system templates repository](https://github.com/opensearch-project/opensearch-system-templates/tree/main/src/main/resources/org/opensearch/system/applicationtemplates/v1). + +To view the current version of these templates on your cluster, use `GET /_component_template`. diff --git a/_ingest-pipelines/processors/index-processors.md b/_ingest-pipelines/processors/index-processors.md index 0e1ee1e114d..9628a167280 100644 --- a/_ingest-pipelines/processors/index-processors.md +++ b/_ingest-pipelines/processors/index-processors.md @@ -69,6 +69,12 @@ Processor type | Description `urldecode` | Decodes a string from URL-encoded format. `user_agent` | Extracts details from the user agent sent by a browser to its web requests. +## Processor limit settings + +You can limit the number of ingest processors using the cluster setting `cluster.ingest.max_number_processors`. The total number of processors includes both the number of processors and the number of [`on_failure`]({{site.url}}{{site.baseurl}}/ingest-pipelines/pipeline-failures/) processors. + +The default value for `cluster.ingest.max_number_processors` is `Integer.MAX_VALUE`. Adding a higher number of processors than the value configured in `cluster.ingest.max_number_processors` will throw an `IllegalStateException`. + ## Batch-enabled processors Some processors support batch ingestion---they can process multiple documents at the same time as a batch. These batch-enabled processors usually provide better performance when using batch processing. For batch processing, use the [Bulk API]({{site.url}}{{site.baseurl}}/api-reference/document-apis/bulk/) and provide a `batch_size` parameter. All batch-enabled processors have a batch mode and a single-document mode. When you ingest documents using the `PUT` method, the processor functions in single-document mode and processes documents in series. Currently, only the `text_embedding` and `sparse_encoding` processors are batch enabled. All other processors process documents one at a time. diff --git a/_ingest-pipelines/processors/text-chunking.md b/_ingest-pipelines/processors/text-chunking.md index 97229d2aaaf..4dccca49262 100644 --- a/_ingest-pipelines/processors/text-chunking.md +++ b/_ingest-pipelines/processors/text-chunking.md @@ -42,6 +42,9 @@ The following table lists the required and optional parameters for the `text_chu | `description` | String | Optional | A brief description of the processor. | | `tag` | String | Optional | An identifier tag for the processor. Useful when debugging in order to distinguish between processors of the same type. | +To perform chunking on nested fields, specify `input_field` and `output_field` values as JSON objects. Dot paths of nested fields are not supported. For example, use `"field_map": { "foo": { "bar": "bar_chunk"} }` instead of `"field_map": { "foo.bar": "foo.bar_chunk"}`. +{: .note} + ### Fixed token length algorithm The following table lists the optional parameters for the `fixed_token_length` algorithm. diff --git a/_install-and-configure/configuring-opensearch/index-settings.md b/_install-and-configure/configuring-opensearch/index-settings.md index a1894a0d2c4..bd9b9651aaf 100644 --- a/_install-and-configure/configuring-opensearch/index-settings.md +++ b/_install-and-configure/configuring-opensearch/index-settings.md @@ -73,6 +73,12 @@ OpenSearch supports the following dynamic cluster-level index settings: - `cluster.remote_store.segment.transfer_timeout` (Time unit): Controls the maximum amount of time to wait for all new segments to update after refresh to the remote store. If the upload does not complete within a specified amount of time, it throws a `SegmentUploadFailedException` error. Default is `30m`. It has a minimum constraint of `10m`. +- `cluster.remote_store.translog.path.prefix` (String): Controls the fixed path prefix for translog data on a remote-store-enabled cluster. This setting only applies when the `cluster.remote_store.index.path.type` setting is either `HASHED_PREFIX` or `HASHED_INFIX`. Default is an empty string, `""`. + +- `cluster.remote_store.segments.path.prefix` (String): Controls the fixed path prefix for segment data on a remote-store-enabled cluster. This setting only applies when the `cluster.remote_store.index.path.type` setting is either `HASHED_PREFIX` or `HASHED_INFIX`. Default is an empty string, `""`. + +- `cluster.snapshot.shard.path.prefix` (String): Controls the fixed path prefix for snapshot shard-level blobs. This setting only applies when the repository `shard_path_type` setting is either `HASHED_PREFIX` or `HASHED_INFIX`. Default is an empty string, `""`. + ## Index-level index settings You can specify index settings at index creation. There are two types of index settings: diff --git a/_install-and-configure/configuring-opensearch/network-settings.md b/_install-and-configure/configuring-opensearch/network-settings.md index f96dde97e16..dc61ccc49b9 100644 --- a/_install-and-configure/configuring-opensearch/network-settings.md +++ b/_install-and-configure/configuring-opensearch/network-settings.md @@ -51,7 +51,7 @@ OpenSearch supports the following advanced network settings for transport commun ## Selecting the transport -The default OpenSearch transport is provided by the `transport-netty4` module and uses the [Netty 4](https://netty.io/) engine for both internal TCP-based communication between nodes in the cluster and external HTTP-based communication with clients. This communication is fully asynchronous and non-blocking. However, there are other transport plugins available that can be used interchangeably: +The default OpenSearch transport is provided by the `transport-netty4` module and uses the [Netty 4](https://netty.io/) engine for both internal TCP-based communication between nodes in the cluster and external HTTP-based communication with clients. This communication is fully asynchronous and non-blocking. The following table lists other available transport plugins that can be used interchangeably. Plugin | Description :---------- | :-------- diff --git a/_install-and-configure/configuring-opensearch/search-settings.md b/_install-and-configure/configuring-opensearch/search-settings.md index c3c4337d01b..e53f05aa649 100644 --- a/_install-and-configure/configuring-opensearch/search-settings.md +++ b/_install-and-configure/configuring-opensearch/search-settings.md @@ -39,6 +39,8 @@ OpenSearch supports the following search settings: - `search.dynamic_pruning.cardinality_aggregation.max_allowed_cardinality` (Dynamic, integer): Determines the threshold for applying dynamic pruning in cardinality aggregation. If a field’s cardinality exceeds this threshold, the aggregation reverts to the default method. This is an experimental feature and may change or be removed in future versions. +- `search.keyword_index_or_doc_values_enabled` (Dynamic, Boolean): Determines whether to use the index or doc values when running `multi_term` queries on `keyword` fields. Default value is `false`. + ## Point in Time settings For information about PIT settings, see [PIT settings]({{site.url}}{{site.baseurl}}/search-plugins/point-in-time-api/#pit-settings). diff --git a/_install-and-configure/configuring-opensearch/security-settings.md b/_install-and-configure/configuring-opensearch/security-settings.md index b9c375d2087..2ac09a48197 100644 --- a/_install-and-configure/configuring-opensearch/security-settings.md +++ b/_install-and-configure/configuring-opensearch/security-settings.md @@ -9,7 +9,7 @@ nav_order: 40 The Security plugin provides a number of YAML configuration files that are used to store the necessary settings that define the way the Security plugin manages users, roles, and activity within the cluster. For a full list of the Security plugin configuration files, see [Modifying the YAML files]({{site.url}}{{site.baseurl}}/security/configuration/yaml/). -The following sections describe security-related settings in `opensearch.yml`. To learn more about static and dynamic settings, see [Configuring OpenSearch]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/index/). +The following sections describe security-related settings in `opensearch.yml`. You can find the `opensearch.yml` in the `/config/opensearch.yml`. To learn more about static and dynamic settings, see [Configuring OpenSearch]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/index/). ## Common settings diff --git a/_install-and-configure/plugins.md b/_install-and-configure/plugins.md index 3a5d6a18345..e96b29e8228 100644 --- a/_install-and-configure/plugins.md +++ b/_install-and-configure/plugins.md @@ -181,7 +181,7 @@ Continue with installation? [y/N]y ### Install a plugin using Maven coordinates -The `opensearch-plugin install` tool also allows you to specify Maven coordinates for available artifacts and versions hosted on [Maven Central](https://search.maven.org/search?q=org.opensearch.plugin). The tool parses the Maven coordinates you provide and constructs a URL. As a result, the host must be able to connect directly to the Maven Central site. The plugin installation fails if you pass coordinates to a proxy or local repository. +The `opensearch-plugin install` tool also allows you to specify Maven coordinates for available artifacts and versions hosted on [Maven Central](https://central.sonatype.com/namespace/org.opensearch.plugin). The tool parses the Maven coordinates you provide and constructs a URL. As a result, the host must be able to connect directly to the Maven Central site. The plugin installation fails if you pass coordinates to a proxy or local repository. #### Usage ```bash diff --git a/_layouts/search_layout.html b/_layouts/search_layout.html index 47b8f25d1c6..67e877fcb80 100644 --- a/_layouts/search_layout.html +++ b/_layouts/search_layout.html @@ -38,12 +38,16 @@

Filter results

- +
- - + + +
+
+ +
@@ -97,10 +101,7 @@

element.value).join(','); const urlPath = window.location.pathname; const versionMatch = urlPath.match(/(\d+\.\d+)/); const docsVersion = versionMatch ? versionMatch[1] : "latest"; @@ -139,11 +140,12 @@

{ + categoryBlog.addEventListener('change', () => { + updateAllCheckbox(); + triggerSearch(searchInput.value.trim()); + }); + categoryEvent.addEventListener('change', () => { updateAllCheckbox(); triggerSearch(searchInput.value.trim()); }); diff --git a/_ml-commons-plugin/api/async-batch-ingest.md b/_ml-commons-plugin/api/async-batch-ingest.md new file mode 100644 index 00000000000..ace95ba4d4d --- /dev/null +++ b/_ml-commons-plugin/api/async-batch-ingest.md @@ -0,0 +1,97 @@ +--- +layout: default +title: Asynchronous batch ingestion +parent: ML Commons APIs +has_children: false +has_toc: false +nav_order: 35 +--- + +# Asynchronous batch ingestion +**Introduced 2.17** +{: .label .label-purple } + +Use the Asynchronous Batch Ingestion API to ingest data into your OpenSearch cluster from your files on remote file servers, such as Amazon Simple Storage Service (Amazon S3) or OpenAI. For detailed configuration steps, see [Asynchronous batch ingestion]({{site.url}}{{site.baseurl}}/ml-commons-plugin/remote-models/async-batch-ingestion/). + +## Path and HTTP methods + +```json +POST /_plugins/_ml/_batch_ingestion +``` + +#### Request fields + +The following table lists the available request fields. + +Field | Data type | Required/Optional | Description +:--- | :--- | :--- +`index_name`| String | Required | The index name. +`field_map` | Object | Required | Maps fields from the source file to specific fields in an OpenSearch index for ingestion. +`ingest_fields` | Array | Optional | Lists fields from the source file that should be ingested directly into the OpenSearch index without any additional mapping. +`credential` | Object | Required | Contains the authentication information for accessing external data sources, such as Amazon S3 or OpenAI. +`data_source` | Object | Required | Specifies the type and location of the external file(s) from which the data is ingested. +`data_source.type` | String | Required | Specifies the type of the external data source. Valid values are `s3` and `openAI`. +`data_source.source` | Array | Required | Specifies one or more file locations from which the data is ingested. For `s3`, specify the file path to the Amazon S3 bucket (for example, `["s3://offlinebatch/output/sagemaker_batch.json.out"]`). For `openAI`, specify the file IDs for input or output files (for example, `["file-", "file-", "file-"]`). + +## Example request: Ingesting a single file + +```json +POST /_plugins/_ml/_batch_ingestion +{ + "index_name": "my-nlp-index", + "field_map": { + "chapter": "$.content[0]", + "title": "$.content[1]", + "chapter_embedding": "$.SageMakerOutput[0]", + "title_embedding": "$.SageMakerOutput[1]", + "_id": "$.id" + }, + "ingest_fields": ["$.id"], + "credential": { + "region": "us-east-1", + "access_key": "", + "secret_key": "", + "session_token": "" + }, + "data_source": { + "type": "s3", + "source": ["s3://offlinebatch/output/sagemaker_batch.json.out"] + } +} +``` +{% include copy-curl.html %} + +## Example request: Ingesting multiple files + +```json +POST /_plugins/_ml/_batch_ingestion +{ + "index_name": "my-nlp-index-openai", + "field_map": { + "question": "source[1].$.body.input[0]", + "answer": "source[1].$.body.input[1]", + "question_embedding":"source[0].$.response.body.data[0].embedding", + "answer_embedding":"source[0].$.response.body.data[1].embedding", + "_id": ["source[0].$.custom_id", "source[1].$.custom_id"] + }, + "ingest_fields": ["source[2].$.custom_field1", "source[2].$.custom_field2"], + "credential": { + "openAI_key": "" + }, + "data_source": { + "type": "openAI", + "source": ["file-", "file-", "file-"] + } +} +``` +{% include copy-curl.html %} + +## Example response + +```json +{ + "task_id": "cbsPlpEBMHcagzGbOQOx", + "task_type": "BATCH_INGEST", + "status": "CREATED" +} +``` diff --git a/_ml-commons-plugin/api/connector-apis/update-connector.md b/_ml-commons-plugin/api/connector-apis/update-connector.md index 64790bb57f5..625d58bb620 100644 --- a/_ml-commons-plugin/api/connector-apis/update-connector.md +++ b/_ml-commons-plugin/api/connector-apis/update-connector.md @@ -29,17 +29,20 @@ PUT /_plugins/_ml/connectors/ The following table lists the updatable fields. For more information about all connector fields, see [Blueprint configuration parameters]({{site.url}}{{site.baseurl}}/ml-commons-plugin/remote-models/blueprints#configuration-parameters). -| Field | Data type | Description | -| :--- | :--- | :--- | -| `name` | String | The name of the connector. | -| `description` | String | A description of the connector. | -| `version` | Integer | The version of the connector. | -| `protocol` | String | The protocol for the connection. For AWS services, such as Amazon SageMaker and Amazon Bedrock, use `aws_sigv4`. For all other services, use `http`. | -| `parameters` | JSON object | The default connector parameters, including `endpoint` and `model`. Any parameters included in this field can be overridden by parameters specified in a predict request. | +| Field | Data type | Description | +| :--- |:------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `name` | String | The name of the connector. | +| `description` | String | A description of the connector. | +| `version` | Integer | The connector version. | +| `protocol` | String | The protocol for the connection. For AWS services, such as Amazon SageMaker and Amazon Bedrock, use `aws_sigv4`. For all other services, use `http`. | +| `parameters` | JSON object | The default connector parameters, including `endpoint` and `model`. Any parameters included in this field can be overridden by parameters specified in a predict request. | | `credential` | JSON object | Defines any credential variables required in order to connect to your chosen endpoint. ML Commons uses **AES/GCM/NoPadding** symmetric encryption to encrypt your credentials. When the connection to the cluster first starts, OpenSearch creates a random 32-byte encryption key that persists in OpenSearch's system index. Therefore, you do not need to manually set the encryption key. | -| `actions` | JSON array | Defines which actions can run within the connector. If you're an administrator creating a connection, add the [blueprint]({{site.url}}{{site.baseurl}}/ml-commons-plugin/remote-models/blueprints/) for your desired connection. | -| `backend_roles` | JSON array | A list of OpenSearch backend roles. For more information about setting up backend roles, see [Assigning backend roles to users]({{site.url}}{{site.baseurl}}/ml-commons-plugin/model-access-control#assigning-backend-roles-to-users). | -| `access_mode` | String | Sets the access mode for the model, either `public`, `restricted`, or `private`. Default is `private`. For more information about `access_mode`, see [Model groups]({{site.url}}{{site.baseurl}}/ml-commons-plugin/model-access-control#model-groups). | +| `actions` | JSON array | Defines which actions can run within the connector. If you're an administrator creating a connection, add the [blueprint]({{site.url}}{{site.baseurl}}/ml-commons-plugin/remote-models/blueprints/) for your desired connection. | +| `backend_roles` | JSON array | A list of OpenSearch backend roles. For more information about setting up backend roles, see [Assigning backend roles to users]({{site.url}}{{site.baseurl}}/ml-commons-plugin/model-access-control#assigning-backend-roles-to-users). | +| `access_mode` | String | Sets the access mode for the model, either `public`, `restricted`, or `private`. Default is `private`. For more information about `access_mode`, see [Model groups]({{site.url}}{{site.baseurl}}/ml-commons-plugin/model-access-control#model-groups). | +| `parameters.skip_validating_missing_parameters` | Boolean | When set to `true`, this option allows you to send a request using a connector without validating any missing parameters. Default is `false`. | + + #### Example request diff --git a/_ml-commons-plugin/api/execute-algorithm.md b/_ml-commons-plugin/api/execute-algorithm.md index 7b06cfefe83..6acd9264448 100644 --- a/_ml-commons-plugin/api/execute-algorithm.md +++ b/_ml-commons-plugin/api/execute-algorithm.md @@ -2,7 +2,7 @@ layout: default title: Execute algorithm parent: ML Commons APIs -nav_order: 30 +nav_order: 37 --- # Execute algorithm diff --git a/_ml-commons-plugin/api/model-apis/batch-predict.md b/_ml-commons-plugin/api/model-apis/batch-predict.md index b32fbb108db..c1dc7348fec 100644 --- a/_ml-commons-plugin/api/model-apis/batch-predict.md +++ b/_ml-commons-plugin/api/model-apis/batch-predict.md @@ -31,7 +31,13 @@ POST /_plugins/_ml/models//_batch_predict ## Prerequisites -Before using the Batch Predict API, you need to create a connector to the externally hosted model. For example, to create a connector to an OpenAI `text-embedding-ada-002` model, send the following request: +Before using the Batch Predict API, you need to create a connector to the externally hosted model. For each action, specify the `action_type` parameter that describes the action: + +- `batch_predict`: Runs the batch predict operation. +- `batch_predict_status`: Checks the batch predict operation status. +- `cancel_batch_predict`: Cancels the batch predict operation. + +For example, to create a connector to an OpenAI `text-embedding-ada-002` model, send the following request. The `cancel_batch_predict` action is optional and supports canceling the batch job running on OpenAI: ```json POST /_plugins/_ml/connectors/_create @@ -68,6 +74,22 @@ POST /_plugins/_ml/connectors/_create "Authorization": "Bearer ${credential.openAI_key}" }, "request_body": "{ \"input_file_id\": \"${parameters.input_file_id}\", \"endpoint\": \"${parameters.endpoint}\", \"completion_window\": \"24h\" }" + }, + { + "action_type": "batch_predict_status", + "method": "GET", + "url": "https://api.openai.com/v1/batches/${parameters.id}", + "headers": { + "Authorization": "Bearer ${credential.openAI_key}" + } + }, + { + "action_type": "cancel_batch_predict", + "method": "POST", + "url": "https://api.openai.com/v1/batches/${parameters.id}/cancel", + "headers": { + "Authorization": "Bearer ${credential.openAI_key}" + } } ] } @@ -123,45 +145,87 @@ POST /_plugins/_ml/models/lyjxwZABNrAVdFa9zrcZ/_batch_predict #### Example response +The response contains the task ID for the batch predict operation: + ```json { - "inference_results": [ - { - "output": [ - { - "name": "response", - "dataAsMap": { - "id": "batch_", - "object": "batch", - "endpoint": "/v1/embeddings", - "errors": null, - "input_file_id": "file-", - "completion_window": "24h", - "status": "validating", - "output_file_id": null, - "error_file_id": null, - "created_at": 1722037257, - "in_progress_at": null, - "expires_at": 1722123657, - "finalizing_at": null, - "completed_at": null, - "failed_at": null, - "expired_at": null, - "cancelling_at": null, - "cancelled_at": null, - "request_counts": { - "total": 0, - "completed": 0, - "failed": 0 - }, - "metadata": null - } - } - ], - "status_code": 200 - } - ] + "task_id": "KYZSv5EBqL2d0mFvs80C", + "status": "CREATED" } ``` -For the definition of each field in the result, see [OpenAI Batch API](https://platform.openai.com/docs/guides/batch). Once the batch inference is complete, you can download the output by calling the [OpenAI Files API](https://platform.openai.com/docs/api-reference/files) and providing the file name specified in the `id` field of the response. \ No newline at end of file +To check the status of the batch predict job, provide the task ID to the [Tasks API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/tasks-apis/get-task/). You can find the job details in the `remote_job` field in the task. Once the prediction is complete, the task `state` changes to `COMPLETED`. + +#### Example request + +```json +GET /_plugins/_ml/tasks/KYZSv5EBqL2d0mFvs80C +``` +{% include copy-curl.html %} + +#### Example response + +The response contains the batch predict operation details in the `remote_job` field: + +```json +{ + "model_id": "JYZRv5EBqL2d0mFvKs1E", + "task_type": "BATCH_PREDICTION", + "function_name": "REMOTE", + "state": "RUNNING", + "input_type": "REMOTE", + "worker_node": [ + "Ee5OCIq0RAy05hqQsNI1rg" + ], + "create_time": 1725491751455, + "last_update_time": 1725491751455, + "is_async": false, + "remote_job": { + "cancelled_at": null, + "metadata": null, + "request_counts": { + "total": 3, + "completed": 3, + "failed": 0 + }, + "input_file_id": "file-XXXXXXXXXXXX", + "output_file_id": "file-XXXXXXXXXXXXX", + "error_file_id": null, + "created_at": 1725491753, + "in_progress_at": 1725491753, + "expired_at": null, + "finalizing_at": 1725491757, + "completed_at": null, + "endpoint": "/v1/embeddings", + "expires_at": 1725578153, + "cancelling_at": null, + "completion_window": "24h", + "id": "batch_XXXXXXXXXXXXXXX", + "failed_at": null, + "errors": null, + "object": "batch", + "status": "in_progress" + } +} +``` + +For the definition of each field in the result, see [OpenAI Batch API](https://platform.openai.com/docs/guides/batch). Once the batch inference is complete, you can download the output by calling the [OpenAI Files API](https://platform.openai.com/docs/api-reference/files) and providing the file name specified in the `id` field of the response. + +### Canceling a batch predict job + +You can also cancel the batch predict operation running on the remote platform using the task ID returned by the batch predict request. To add this capability, set the `action_type` to `cancel_batch_predict` in the connector configuration when creating the connector. + +#### Example request + +```json +POST /_plugins/_ml/tasks/KYZSv5EBqL2d0mFvs80C/_cancel_batch +``` +{% include copy-curl.html %} + +#### Example response + +```json +{ + "status": "OK" +} +``` diff --git a/_ml-commons-plugin/index.md b/_ml-commons-plugin/index.md index f0355b6be3e..50d637379e1 100644 --- a/_ml-commons-plugin/index.md +++ b/_ml-commons-plugin/index.md @@ -32,6 +32,10 @@ ML Commons supports various algorithms to help train ML models and make predicti ML Commons provides its own set of REST APIs. For more information, see [ML Commons API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/index/). +## ML-powered search + +For information about available ML-powered search types, see [ML-powered search]({{site.url}}{{site.baseurl}}/search-plugins/index/#ml-powered-search). + ## Tutorials Using the OpenSearch ML framework, you can build various applications, from implementing conversational search to building your own chatbot. For more information, see [Tutorials]({{site.url}}{{site.baseurl}}/ml-commons-plugin/tutorials/index/). \ No newline at end of file diff --git a/_ml-commons-plugin/remote-models/async-batch-ingestion.md b/_ml-commons-plugin/remote-models/async-batch-ingestion.md new file mode 100644 index 00000000000..a09c0284778 --- /dev/null +++ b/_ml-commons-plugin/remote-models/async-batch-ingestion.md @@ -0,0 +1,190 @@ +--- +layout: default +title: Asynchronous batch ingestion +nav_order: 90 +parent: Connecting to externally hosted models +grand_parent: Integrating ML models +--- + + +# Asynchronous batch ingestion +**Introduced 2.17** +{: .label .label-purple } + +[Batch ingestion]({{site.url}}{{site.baseurl}}/ml-commons-plugin/remote-models/batch-ingestion/) configures an ingest pipeline, which processes documents one by one. For each document, batch ingestion calls an externally hosted model to generate text embeddings from the document text and then ingests the document, including text and embeddings, into an OpenSearch index. + +An alternative to this real-time process, _asynchronous_ batch ingestion, ingests both documents and their embeddings generated outside of OpenSearch and stored on a remote file server, such as Amazon Simple Storage Service (Amazon S3) or OpenAI. Asynchronous ingestion returns a task ID and runs asynchronously to ingest data offline into your k-NN cluster for neural search. You can use asynchronous batch ingestion together with the [Batch Predict API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/model-apis/batch-predict/) to perform inference asynchronously. The batch predict operation takes an input file containing documents and calls an externally hosted model to generate embeddings for those documents in an output file. You can then use asynchronous batch ingestion to ingest both the input file containing documents and the output file containing their embeddings into an OpenSearch index. + +As of OpenSearch 2.17, the Asynchronous Batch Ingestion API is supported by Amazon SageMaker, Amazon Bedrock, and OpenAI. +{: .note} + +## Prerequisites + +Before using asynchronous batch ingestion, you must generate text embeddings using a model of your choice and store the output on a file server, such as Amazon S3. For example, you can store the output of a Batch API call to an Amazon SageMaker text embedding model in a file with the Amazon S3 output path `s3://offlinebatch/output/sagemaker_batch.json.out`. The output is in JSONL format, with each line representing a text embedding result. The file contents have the following format: + +``` +{"SageMakerOutput":[[-0.017166402,0.055771016,...],[-0.06422759,-0.004301484,...],"content":["this is chapter 1","harry potter"],"id":1} +{"SageMakerOutput":[[-0.017455402,0.023771016,...],[-0.02322759,-0.009101284,...],"content":["this is chapter 2","draco malfoy"],"id":1} +... +``` + +## Ingesting data from a single file + +First, create a k-NN index into which you'll ingest the data. The fields in the k-NN index represent the structure of the data in the source file. + +In this example, the source file holds documents containing titles and chapters, along with their corresponding embeddings. Thus, you'll create a k-NN index with the fields `id`, `chapter_embedding`, `chapter`, `title_embedding`, and `title`: + +```json +PUT /my-nlp-index +{ + "settings": { + "index.knn": true + }, + "mappings": { + "properties": { + "id": { + "type": "text" + }, + "chapter_embedding": { + "type": "knn_vector", + "dimension": 384, + "method": { + "engine": "nmslib", + "space_type": "cosinesimil", + "name": "hnsw", + "parameters": { + "ef_construction": 512, + "m": 16 + } + } + }, + "chapter": { + "type": "text" + }, + "title_embedding": { + "type": "knn_vector", + "dimension": 384, + "method": { + "engine": "nmslib", + "space_type": "cosinesimil", + "name": "hnsw", + "parameters": { + "ef_construction": 512, + "m": 16 + } + } + }, + "title": { + "type": "text" + } + } + } +} +``` +{% include copy-curl.html %} + +When using an S3 file as the source for asynchronous batch ingestion, you must map the fields in the source file to fields in the index in order to indicate into which index each piece of data is ingested. If no JSON path is provided for a field, that field will be set to `null` in the k-NN index. + +In the `field_map`, indicate the location of the data for each field in the source file. You can also specify fields to be ingested directly into your index without making any changes to the source file by adding their JSON paths to the `ingest_fields` array. For example, in the following asynchronous batch ingestion request, the element with the JSON path `$.id` from the source file is ingested directly into the `id` field of your index. To ingest this data from the Amazon S3 file, send the following request to your OpenSearch endpoint: + +```json +POST /_plugins/_ml/_batch_ingestion +{ + "index_name": "my-nlp-index", + "field_map": { + "chapter": "$.content[0]", + "title": "$.content[1]", + "chapter_embedding": "$.SageMakerOutput[0]", + "title_embedding": "$.SageMakerOutput[1]", + "_id": "$.id" + }, + "ingest_fields": ["$.id"], + "credential": { + "region": "us-east-1", + "access_key": "", + "secret_key": "", + "session_token": "" + }, + "data_source": { + "type": "s3", + "source": ["s3://offlinebatch/output/sagemaker_batch.json.out"] + } +} +``` +{% include copy-curl.html %} + +The response contains a task ID for the ingestion task: + +```json +{ + "task_id": "cbsPlpEBMHcagzGbOQOx", + "task_type": "BATCH_INGEST", + "status": "CREATED" +} +``` + +To check the status of the operation, provide the task ID to the [Tasks API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/tasks-apis/get-task/). Once ingestion is complete, the task `state` changes to `COMPLETED`. + + +## Ingesting data from multiple files + +You can also ingest data from multiple files by specifying the file locations in the `source`. The following example ingests data from three OpenAI files. + +The OpenAI Batch API input file is formatted as follows: + +``` +{"custom_id": "request-1", "method": "POST", "url": "/v1/embeddings", "body": {"model": "text-embedding-ada-002", "input": [ "What is the meaning of life?", "The food was delicious and the waiter..."]}} +{"custom_id": "request-2", "method": "POST", "url": "/v1/embeddings", "body": {"model": "text-embedding-ada-002", "input": [ "What is the meaning of work?", "The travel was fantastic and the view..."]}} +{"custom_id": "request-3", "method": "POST", "url": "/v1/embeddings", "body": {"model": "text-embedding-ada-002", "input": [ "What is the meaning of friend?", "The old friend was far away and the time..."]}} +... +``` + +The OpenAI Batch API output file is formatted as follows: + +``` +{"id": "batch_req_ITKQn29igorXCAGp6wzYs5IS", "custom_id": "request-1", "response": {"status_code": 200, "request_id": "10845755592510080d13054c3776aef4", "body": {"object": "list", "data": [{"object": "embedding", "index": 0, "embedding": [0.0044326545, ... ...]}, {"object": "embedding", "index": 1, "embedding": [0.002297497, ... ... ]}], "model": "text-embedding-ada-002", "usage": {"prompt_tokens": 15, "total_tokens": 15}}}, "error": null} +... +``` + +If you have run the Batch API in OpenAI for text embedding and want to ingest the model input and output files along with some metadata into your index, send the following asynchronous ingestion request. Make sure to use `source[file-index]` to identify the file's location in the source array in the request body. For example, `source[0]` refers to the first file in the `data_source.source` array. + +The following request ingests seven fields into your index: Five are specified in the `field_map` section and two are specified in `ingest_fields`. The format follows the pattern `sourcefile.jsonPath`, indicating the JSON path for each file. In the field_map, `$.body.input[0]` is used as the JSON path to ingest data into the `question` field from the second file in the `source` array. The `ingest_fields` array lists all elements from the `source` files that will be ingested directly into your index: + +```json +POST /_plugins/_ml/_batch_ingestion +{ + "index_name": "my-nlp-index-openai", + "field_map": { + "question": "source[1].$.body.input[0]", + "answer": "source[1].$.body.input[1]", + "question_embedding":"source[0].$.response.body.data[0].embedding", + "answer_embedding":"source[0].$.response.body.data[1].embedding", + "_id": ["source[0].$.custom_id", "source[1].$.custom_id"] + }, + "ingest_fields": ["source[2].$.custom_field1", "source[2].$.custom_field2"], + "credential": { + "openAI_key": "" + }, + "data_source": { + "type": "openAI", + "source": ["file-", "file-", "file-"] + } +} +``` +{% include copy-curl.html %} + +In the request, make sure to define the `_id` field in the `field_map`. This is necessary in order to map each data entry from the three separate files. + +The response contains a task ID for the ingestion task: + +```json +{ + "task_id": "cbsPlpEBMHcagzGbOQOx", + "task_type": "BATCH_INGEST", + "status": "CREATED" +} +``` + +To check the status of the operation, provide the task ID to the [Tasks API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/tasks-apis/get-task/). Once ingestion is complete, the task `state` changes to `COMPLETED`. + +For request field descriptions, see [Asynchronous Batch Ingestion API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api/async-batch-ingest/). \ No newline at end of file diff --git a/_ml-commons-plugin/remote-models/blueprints.md b/_ml-commons-plugin/remote-models/blueprints.md index 254a21b068c..9b95c311661 100644 --- a/_ml-commons-plugin/remote-models/blueprints.md +++ b/_ml-commons-plugin/remote-models/blueprints.md @@ -55,19 +55,20 @@ As an ML developer, you can build connector blueprints for other platforms. Usin ## Configuration parameters -| Field | Data type | Is required | Description | -|:---|:---|:---|:---| -| `name` | String | Yes | The name of the connector. | -| `description` | String | Yes | A description of the connector. | -| `version` | Integer | Yes | The version of the connector. | -| `protocol` | String | Yes | The protocol for the connection. For AWS services such as Amazon SageMaker and Amazon Bedrock, use `aws_sigv4`. For all other services, use `http`. | -| `parameters` | JSON object | Yes | The default connector parameters, including `endpoint` and `model`. Any parameters indicated in this field can be overridden by parameters specified in a predict request. | -| `credential` | JSON object | Yes | Defines any credential variables required to connect to your chosen endpoint. ML Commons uses **AES/GCM/NoPadding** symmetric encryption to encrypt your credentials. When the connection to the cluster first starts, OpenSearch creates a random 32-byte encryption key that persists in OpenSearch's system index. Therefore, you do not need to manually set the encryption key. | -| `actions` | JSON array | Yes | Defines what actions can run within the connector. If you're an administrator creating a connection, add the [blueprint]({{site.url}}{{site.baseurl}}/ml-commons-plugin/remote-models/blueprints/) for your desired connection. | -| `backend_roles` | JSON array | Yes | A list of OpenSearch backend roles. For more information about setting up backend roles, see [Assigning backend roles to users]({{site.url}}{{site.baseurl}}/ml-commons-plugin/model-access-control#assigning-backend-roles-to-users). | -| `access_mode` | String | Yes | Sets the access mode for the model, either `public`, `restricted`, or `private`. Default is `private`. For more information about `access_mode`, see [Model groups]({{site.url}}{{site.baseurl}}/ml-commons-plugin/model-access-control#model-groups). | -| `add_all_backend_roles` | Boolean | Yes | When set to `true`, adds all `backend_roles` to the access list, which only a user with admin permissions can adjust. When set to `false`, non-admins can add `backend_roles`. | -| `client_config` | JSON object | No | The client configuration object, which provides settings that control the behavior of the client connections used by the connector. These settings allow you to manage connection limits and timeouts, ensuring efficient and reliable communication. | +| Field | Data type | Is required | Description | +|:-------------------------------------------------|:---|:------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `name` | String | Yes | The name of the connector. | +| `description` | String | Yes | A description of the connector. | +| `version` | Integer | Yes | The connector version. | +| `protocol` | String | Yes | The protocol for the connection. For AWS services, such as Amazon SageMaker and Amazon Bedrock, use `aws_sigv4`. For all other services, use `http`. | +| `parameters` | JSON object | Yes | The default connector parameters, including `endpoint`, `model`, and `skip_validating_missing_parameters`. Any parameters indicated in this field can be overridden by parameters specified in a predict request. | +| `credential` | JSON object | Yes | Defines any credential variables required for connecting to your chosen endpoint. ML Commons uses **AES/GCM/NoPadding** symmetric encryption to encrypt your credentials. When the cluster connection is initiated, OpenSearch creates a random 32-byte encryption key that persists in OpenSearch's system index. Therefore, you do not need to manually set the encryption key. | +| `actions` | JSON array | Yes | Defines the actions that can run within the connector. If you're an administrator creating a connection, add the [blueprint]({{site.url}}{{site.baseurl}}/ml-commons-plugin/remote-models/blueprints/) for your desired connection. | +| `backend_roles` | JSON array | Yes | A list of OpenSearch backend roles. For more information about setting up backend roles, see [Assigning backend roles to users]({{site.url}}{{site.baseurl}}/ml-commons-plugin/model-access-control#assigning-backend-roles-to-users). | +| `access_mode` | String | Yes | Sets the access mode for the model, either `public`, `restricted`, or `private`. Default is `private`. For more information about `access_mode`, see [Model groups]({{site.url}}{{site.baseurl}}/ml-commons-plugin/model-access-control#model-groups). | +| `add_all_backend_roles` | Boolean | Yes | When set to `true`, adds all `backend_roles` to the access list, which only a user with admin permissions can adjust. When set to `false`, non-admins can add `backend_roles`. | +| `client_config` | JSON object | No | The client configuration object, which provides settings that control the behavior of the client connections used by the connector. These settings allow you to manage connection limits and timeouts, ensuring efficient and reliable communication. | +| `parameters.skip_validating_missing_parameters` | Boolean | No | When set to `true`, this option allows you to send a request using a connector without validating any missing parameters. Default is `false`. | The `actions` parameter supports the following options. @@ -76,12 +77,11 @@ The `actions` parameter supports the following options. |:---|:---|:---| | `action_type` | String | Required. Sets the ML Commons API operation to use upon connection. As of OpenSearch 2.9, only `predict` is supported. | | `method` | String | Required. Defines the HTTP method for the API call. Supports `POST` and `GET`. | -| `url` | String | Required. Sets the connection endpoint at which the action occurs. This must match the regex expression for the connection used when [adding trusted endpoints]({{site.url}}{{site.baseurl}}/ml-commons-plugin/remote-models/index#adding-trusted-endpoints). | -| `headers` | JSON object | Sets the headers used inside the request or response body. Default is `ContentType: application/json`. If your third-party ML tool requires access control, define the required `credential` parameters in the `headers` parameter. | +| `url` | String | Required. Specifies the connection endpoint at which the action occurs. This must match the regex expression for the connection used when [adding trusted endpoints]({{site.url}}{{site.baseurl}}/ml-commons-plugin/remote-models/index#adding-trusted-endpoints).| | `request_body` | String | Required. Sets the parameters contained in the request body of the action. The parameters must include `\"inputText\`, which specifies how users of the connector should construct the request payload for the `action_type`. | | `pre_process_function` | String | Optional. A built-in or custom Painless script used to preprocess the input data. OpenSearch provides the following built-in preprocess functions that you can call directly:
- `connector.pre_process.cohere.embedding` for [Cohere](https://cohere.com/) embedding models
- `connector.pre_process.openai.embedding` for [OpenAI](https://platform.openai.com/docs/guides/embeddings) embedding models
- `connector.pre_process.default.embedding`, which you can use to preprocess documents in neural search requests so that they are in the format that ML Commons can process with the default preprocessor (OpenSearch 2.11 or later). For more information, see [Built-in functions](#built-in-pre--and-post-processing-functions). | | `post_process_function` | String | Optional. A built-in or custom Painless script used to post-process the model output data. OpenSearch provides the following built-in post-process functions that you can call directly:
- `connector.pre_process.cohere.embedding` for [Cohere text embedding models](https://docs.cohere.com/reference/embed)
- `connector.pre_process.openai.embedding` for [OpenAI text embedding models](https://platform.openai.com/docs/api-reference/embeddings)
- `connector.post_process.default.embedding`, which you can use to post-process documents in the model response so that they are in the format that neural search expects (OpenSearch 2.11 or later). For more information, see [Built-in functions](#built-in-pre--and-post-processing-functions). | - +| `headers` | JSON object | Specifies the headers used in the request or response body. Default is `ContentType: application/json`. If your third-party ML tool requires access control, define the required `credential` parameters in the `headers` parameter. | The `client_config` parameter supports the following options. diff --git a/_ml-commons-plugin/tutorials/semantic-search-byte-vectors.md b/_ml-commons-plugin/tutorials/semantic-search-byte-vectors.md index 7061d3cb5ac..c4cc27f6609 100644 --- a/_ml-commons-plugin/tutorials/semantic-search-byte-vectors.md +++ b/_ml-commons-plugin/tutorials/semantic-search-byte-vectors.md @@ -7,7 +7,7 @@ nav_order: 10 # Semantic search using byte-quantized vectors -This tutorial illustrates how to build a semantic search using the [Cohere Embed model](https://docs.cohere.com/reference/embed) and byte-quantized vectors. For more information about using byte-quantized vectors, see [Lucene byte vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector/#lucene-byte-vector). +This tutorial shows you how to build a semantic search using the [Cohere Embed model](https://docs.cohere.com/reference/embed) and byte-quantized vectors. For more information about using byte-quantized vectors, see [Byte vectors]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector/#byte-vectors). The Cohere Embed v3 model supports several `embedding_types`. For this tutorial, you'll use the `INT8` type to encode byte-quantized vectors. diff --git a/_monitoring-your-cluster/pa/index.md b/_monitoring-your-cluster/pa/index.md index bb4f9c6c302..156e985e8bf 100644 --- a/_monitoring-your-cluster/pa/index.md +++ b/_monitoring-your-cluster/pa/index.md @@ -60,7 +60,7 @@ private-key-file-path = specify_path The Performance Analyzer plugin is included in the installations for [Docker]({{site.url}}{{site.baseurl}}/opensearch/install/docker/) and [tarball]({{site.url}}{{site.baseurl}}/opensearch/install/tar/), but you can also install the plugin manually. -To install the Performance Analyzer plugin manually, download the plugin from [Maven](https://search.maven.org/search?q=org.opensearch.plugin) and install it using the standard [plugin installation]({{site.url}}{{site.baseurl}}/opensearch/install/plugins/) process. Performance Analyzer runs on each node in a cluster. +To install the Performance Analyzer plugin manually, download the plugin from [Maven](https://central.sonatype.com/namespace/org.opensearch.plugin) and install it using the standard [plugin installation]({{site.url}}{{site.baseurl}}/opensearch/install/plugins/) process. Performance Analyzer runs on each node in a cluster. To start the Performance Analyzer root cause analysis (RCA) agent on a tarball installation, run the following command: diff --git a/_observing-your-data/ad/dashboards-anomaly-detection.md b/_observing-your-data/ad/dashboards-anomaly-detection.md index 679237094a0..ad6fa5950b3 100644 --- a/_observing-your-data/ad/dashboards-anomaly-detection.md +++ b/_observing-your-data/ad/dashboards-anomaly-detection.md @@ -18,12 +18,12 @@ You can connect data visualizations to OpenSearch datasets and then create, run, Before getting started, you must have: - Installed OpenSearch and OpenSearch Dashboards version 2.9 or later. See [Installing OpenSearch]({{site.url}}{{site.baseurl}}/install-and-configure/install-opensearch/index/). -- Installed the Anomaly Detection plugin version 2.9 or later. See [Installing OpenSearch plugins]({{site.url}}{{site.baseurl}}/install-and-configure/plugins). +- Installed the Anomaly Detection plugin version 2.9 or later. See [Installing OpenSearch plugins]/({{site.url}}{{site.baseurl}}/install-and-configure/plugins/). - Installed the Anomaly Detection Dashboards plugin version 2.9 or later. See [Managing OpenSearch Dashboards plugins]({{site.url}}{{site.baseurl}}/install-and-configure/install-dashboards/plugins/) to get started. ## General requirements for anomaly detection visualizations -Anomaly detection visualizations are displayed as time-series charts that give you a snapshot of when anomalies have occurred from different anomaly detectors you have configured for the visualization. You can display up to 10 metrics on your chart, and each series can be shown as a line on the chart. Note that only real-time anomalies will be visible on the chart. For more information on real-time and historical anomaly detection, see [Anomaly detection, Step 3: Set up detector jobs]({{site.url}}{{site.baseurl}}/observing-your-data/ad/index/#step-3-set-up-detector-jobs). +Anomaly detection visualizations are displayed as time-series charts that give you a snapshot of when anomalies have occurred from different anomaly detectors you have configured for the visualization. You can display up to 10 metrics on your chart, and each series can be shown as a line on the chart. Note that only real-time anomalies will be visible on the chart. For more information about real-time and historical anomaly detection, see [Anomaly detection, Step 3: Set up detector jobs]({{site.url}}{{site.baseurl}}/observing-your-data/ad/index/#step-3-setting-up-detector-jobs). Keep in mind the following requirements when setting up or creating anomaly detection visualizations. The visualization: diff --git a/_observing-your-data/ad/index.md b/_observing-your-data/ad/index.md index 5dfa1b8f1a7..657c3c90cb4 100644 --- a/_observing-your-data/ad/index.md +++ b/_observing-your-data/ad/index.md @@ -10,30 +10,42 @@ redirect_from: # Anomaly detection -An anomaly in OpenSearch is any unusual behavior change in your time-series data. Anomalies can provide valuable insights into your data. For example, for IT infrastructure data, an anomaly in the memory usage metric might help you uncover early signs of a system failure. +An _anomaly_ in OpenSearch is any unusual behavior change in your time-series data. Anomalies can provide valuable insights into your data. For example, for IT infrastructure data, an anomaly in the memory usage metric can help identify early signs of a system failure. -It can be challenging to discover anomalies using conventional methods such as creating visualizations and dashboards. You could configure an alert based on a static threshold, but this requires prior domain knowledge and isn't adaptive to data that exhibits organic growth or seasonal behavior. +Conventional techniques like visualizations and dashboards can make it difficult to uncover anomalies. Configuring alerts based on static thresholds is possible, but this approach requires prior domain knowledge and may not adapt to data with organic growth or seasonal trends. -Anomaly detection automatically detects anomalies in your OpenSearch data in near real-time using the Random Cut Forest (RCF) algorithm. RCF is an unsupervised machine learning algorithm that models a sketch of your incoming data stream to compute an `anomaly grade` and `confidence score` value for each incoming data point. These values are used to differentiate an anomaly from normal variations. For more information about how RCF works, see [Random Cut Forests](https://www.semanticscholar.org/paper/Robust-Random-Cut-Forest-Based-Anomaly-Detection-on-Guha-Mishra/ecb365ef9b67cd5540cc4c53035a6a7bd88678f9). +Anomaly detection automatically detects anomalies in your OpenSearch data in near real time using the Random Cut Forest (RCF) algorithm. RCF is an unsupervised machine learning algorithm that models a sketch of your incoming data stream to compute an _anomaly grade_ and _confidence score_ value for each incoming data point. These values are used to differentiate an anomaly from normal variations. For more information about how RCF works, see [Robust Random Cut Forest Based Anomaly Detection on Streams](https://www.semanticscholar.org/paper/Robust-Random-Cut-Forest-Based-Anomaly-Detection-on-Guha-Mishra/ecb365ef9b67cd5540cc4c53035a6a7bd88678f9). You can pair the Anomaly Detection plugin with the [Alerting plugin]({{site.url}}{{site.baseurl}}/monitoring-plugins/alerting/) to notify you as soon as an anomaly is detected. +{: .note} -To get started, choose **Anomaly Detection** in OpenSearch Dashboards. -To first test with sample streaming data, you can try out one of the preconfigured detectors with one of the sample datasets. +## Getting started with anomaly detection in OpenSearch Dashboards + +To get started, go to **OpenSearch Dashboards** > **OpenSearch Plugins** > **Anomaly Detection**. ## Step 1: Define a detector -A detector is an individual anomaly detection task. You can define multiple detectors, and all the detectors can run simultaneously, with each analyzing data from different sources. +A _detector_ is an individual anomaly detection task. You can define multiple detectors, and all detectors can run simultaneously, with each analyzing data from different sources. You can define a detector by following these steps: + +1. On the **Anomaly detection** page, select the **Create detector** button. +2. On the **Define detector** page, enter the required information in the **Detector details** pane. +3. In the **Select data** pane, specify the data source by choosing a source from the **Index** dropdown menu. You can choose an index, index patterns, or an alias. +4. (Optional) Filter the data source by selecting **Add data filter** and then entering the conditions for **Field**, **Operator**, and **Value**. Alternatively, you can choose **Use query DSL** and add your JSON filter query. Only [Boolean queries]({{site.url}}{{site.baseurl}}/query-dsl/compound/bool/) are supported for query domain-specific language (DSL). +#### Example: Filtering data using query DSL + +The following example query retrieves documents in which the `urlPath.keyword` field matches any of the specified values: +======= 1. Choose **Create detector**. 1. Add in the detector details. - Enter a name and brief description. Make sure the name is unique and descriptive enough to help you to identify the purpose of the detector. 1. Specify the data source. - - For **Data source**, choose the index you want to use as the data source. You can optionally use index patterns to choose multiple indexes. + - For **Data source**, choose one or more indexes to use as the data source. Alternatively, you can use an alias or index pattern to choose multiple indexes. + - Detectors can use remote indexes. You can access them using the `cluster-name:index-name` pattern. See [Cross-cluster search]({{site.url}}{{site.baseurl}}/search-plugins/cross-cluster-search/) for more information. Alternatively, you can select clusters and indexes in OpenSearch Dashboards 2.17 or later. To learn about configuring remote indexes with the Security plugin enabled, see [Selecting remote indexes with fine-grained access control]({{site.url}}{{site.baseurl}}/observing-your-data/ad/security/#selecting-remote-indexes-with-fine-grained-access-control) in the [Anomaly detection security](observing-your-data/ad/security/) documentation. - (Optional) For **Data filter**, filter the index you chose as the data source. From the **Data filter** menu, choose **Add data filter**, and then design your filter query by selecting **Field**, **Operator**, and **Value**, or choose **Use query DSL** and add your own JSON filter query. Only [Boolean queries]({{site.url}}{{site.baseurl}}/query-dsl/compound/bool/) are supported for query domain-specific language (DSL). -#### Example filter using query DSL -The query is designed to retrieve documents in which the `urlPath.keyword` field matches one of the following specified values: +To create a cross-cluster detector in OpenSearch Dashboards, the following [permissions]({{site.url}}{{site.baseurl}}/security/access-control/permissions/) are required: `indices:data/read/field_caps`, `indices:admin/resolve/index`, and `cluster:monitor/remote/info`. +{: .note} - /domain/{id}/short - /sub_dir/{id}/short @@ -62,40 +74,38 @@ The query is designed to retrieve documents in which the `urlPath.keyword` field } } ``` + {% include copy-curl.html %} -1. Specify a timestamp. - - Select the **Timestamp field** in your index. -1. Define operation settings. - - For **Operation settings**, define the **Detector interval**, which is the time interval at which the detector collects data. - - The detector aggregates the data in this interval, then feeds the aggregated result into the anomaly detection model. - The shorter you set this interval, the fewer data points the detector aggregates. - The anomaly detection model uses a shingling process, a technique that uses consecutive data points to create a sample for the model. This process needs a certain number of aggregated data points from contiguous intervals. - - - We recommend setting the detector interval based on your actual data. If it's too long it might delay the results, and if it's too short it might miss some data. It also won't have a sufficient number of consecutive data points for the shingle process. +5. In the **Timestamp** pane, select a field from the **Timestamp field** dropdown menu. - - (Optional) To add extra processing time for data collection, specify a **Window delay** value. +6. In the **Operation settings** pane, define the **Detector interval**, which is the interval at which the detector collects data. + - The detector aggregates the data at this interval and then feeds the aggregated result into the anomaly detection model. The shorter the interval, the fewer data points the detector aggregates. The anomaly detection model uses a shingling process, a technique that uses consecutive data points to create a sample for the model. This process requires a certain number of aggregated data points from contiguous intervals. + - You should set the detector interval based on your actual data. If the detector interval is too long, then it might delay the results. If the detector interval is too short, then it might miss some data. The detector interval also will not have a sufficient number of consecutive data points for the shingle process. + - (Optional) To add extra processing time for data collection, specify a **Window delay** value. - This value tells the detector that the data is not ingested into OpenSearch in real time but with a certain delay. Set the window delay to shift the detector interval to account for this delay. - - For example, say the detector interval is 10 minutes and data is ingested into your cluster with a general delay of 1 minute. Assume the detector runs at 2:00. The detector attempts to get the last 10 minutes of data from 1:50 to 2:00, but because of the 1-minute delay, it only gets 9 minutes of data and misses the data from 1:59 to 2:00. Setting the window delay to 1 minute shifts the interval window to 1:49--1:59, so the detector accounts for all 10 minutes of the detector interval time. -1. Specify custom results index. - - The Anomaly Detection plugin allows you to store anomaly detection results in a custom index of your choice. To enable this, select **Enable custom results index** and provide a name for your index, for example, `abc`. The plugin then creates an alias prefixed with `opensearch-ad-plugin-result-` followed by your chosen name, for example, `opensearch-ad-plugin-result-abc`. This alias points to an actual index with a name containing the date and a sequence number, like `opensearch-ad-plugin-result-abc-history-2024.06.12-000002`, where your results are stored. + - For example, the detector interval is 10 minutes and data is ingested into your cluster with a general delay of 1 minute. Assume the detector runs at 2:00. The detector attempts to get the last 10 minutes of data from 1:50 to 2:00, but because of the 1-minute delay, it only gets 9 minutes of data and misses the data from 1:59 to 2:00. Setting the window delay to 1 minute shifts the interval window to 1:49--1:59, so the detector accounts for all 10 minutes of the detector interval time. + - To avoid missing any data, set the **Window delay** to the upper limit of the expected ingestion delay. This ensures that the detector captures all data during its interval, reducing the risk of missing relevant information. While a longer window delay helps capture all data, too long of a window delay can hinder real-time anomaly detection because the detector will look further back in time. Find a balance to maintain both data accuracy and timely detection. - You can use the dash “-” sign to separate the namespace to manage custom results index permissions. For example, if you use `opensearch-ad-plugin-result-financial-us-group1` as the results index, you can create a permission role based on the pattern `opensearch-ad-plugin-result-financial-us-*` to represent the "financial" department at a granular level for the "us" area. +7. Specify a custom results index. + - The Anomaly Detection plugin allows you to store anomaly detection results in a custom index of your choice. Select **Enable custom results index** and provide a name for your index, for example, `abc`. The plugin then creates an alias prefixed with `opensearch-ad-plugin-result-` followed by your chosen name, for example, `opensearch-ad-plugin-result-abc`. This alias points to an actual index with a name containing the date and a sequence number, such as `opensearch-ad-plugin-result-abc-history-2024.06.12-000002`, where your results are stored. + + You can use `-` to separate the namespace to manage custom results index permissions. For example, if you use `opensearch-ad-plugin-result-financial-us-group1` as the results index, you can create a permission role based on the pattern `opensearch-ad-plugin-result-financial-us-*` to represent the `financial` department at a granular level for the `us` group. {: .note } - When the Security plugin (fine-grained access control) is enabled, the default results index becomes a system index and is no longer accessible through the standard Index or Search APIs. To access its content, you must use the Anomaly Detection RESTful API or the dashboard. As a result, you cannot build customized dashboards using the default results index if the Security plugin is enabled. However, you can create a custom results index in order to build customized dashboards. - If the custom index you specify does not exist, the Anomaly Detection plugin will create it when you create the detector and start your real-time or historical analysis. - If the custom index already exists, the plugin will verify that the index mapping matches the required structure for anomaly results. In this case, ensure that the custom index has a valid mapping as defined in the [`anomaly-results.json`](https://github.com/opensearch-project/anomaly-detection/blob/main/src/main/resources/mappings/anomaly-results.json) file. - - To use the custom results index option, you need the following permissions: - - `indices:admin/create` - The Anomaly Detection plugin requires the ability to create and roll over the custom index. - - `indices:admin/aliases` - The Anomaly Detection plugin requires access to create and manage an alias for the custom index. - - `indices:data/write/index` - You need the `write` permission for the Anomaly Detection plugin to write results into the custom index for a single-entity detector. - - `indices:data/read/search` - You need the `search` permission because the Anomaly Detection plugin needs to search custom results indexes to show results on the Anomaly Detection UI. - - `indices:data/write/delete` - Because the detector might generate a large number of anomaly results, you need the `delete` permission to delete old data and save disk space. - - `indices:data/write/bulk*` - You need the `bulk*` permission because the Anomaly Detection plugin uses the bulk API to write results into the custom index. - - Managing the custom results index: - - The anomaly detection dashboard queries all detectors’ results from all custom results indexes. Having too many custom results indexes might impact the performance of the Anomaly Detection plugin. - - You can use [Index State Management]({{site.url}}{{site.baseurl}}/im-plugin/ism/index/) to rollover old results indexes. You can also manually delete or archive any old results indexes. We recommend reusing a custom results index for multiple detectors. - - The Anomaly Detection plugin also provides lifecycle management for custom indexes. It rolls an alias over to a new index when the custom results index meets any of the conditions in the following table. + - To use the custom results index option, you must have the following permissions: + - `indices:admin/create` -- The `create` permission is required in order to create and roll over the custom index. + - `indices:admin/aliases` -- The `aliases` permission is required in order to create and manage an alias for the custom index. + - `indices:data/write/index` -- The `write` permission is required in order to write results into the custom index for a single-entity detector. + - `indices:data/read/search` -- The `search` permission is required in order to search custom results indexes to show results on the Anomaly Detection interface. + - `indices:data/write/delete` -- The detector may generate many anomaly results. The `delete` permission is required in order to delete old data and save disk space. + - `indices:data/write/bulk*` -- The `bulk*` permission is required because the plugin uses the Bulk API to write results into the custom index. + - When managing the custom results index, consider the following: + - The anomaly detection dashboard queries all detector results from all custom results indexes. Having too many custom results indexes can impact the plugin's performance. + - You can use [Index State Management]({{site.url}}{{site.baseurl}}/im-plugin/ism/index/) to roll over old results indexes. You can also manually delete or archive any old results indexes. Reusing a custom results index for multiple detectors is recommended. + - The plugin provides lifecycle management for custom indexes. It rolls over an alias to a new index when the custom results index meets any of the conditions in the following table. Parameter | Description | Type | Unit | Example | Required :--- | :--- |:--- |:--- |:--- |:--- @@ -103,43 +113,52 @@ The query is designed to retrieve documents in which the `urlPath.keyword` field `result_index_min_age` | The minimum index age required for rollover, calculated from its creation time to the current time. | `integer` |`day` | `7` | No `result_index_ttl` | The minimum age required to permanently delete rolled-over indexes. | `integer` | `day` | `60` | No -1. Choose **Next**. +8. Choose **Next**. After you define the detector, the next step is to configure the model. ## Step 2: Configure the model -#### Add features to your detector +1. Add features to your detector. -A feature is the field in your index that you want to check for anomalies. A detector can discover anomalies across one or more features. You must choose an aggregation method for each feature: `average()`, `count()`, `sum()`, `min()`, or `max()`. The aggregation method determines what constitutes an anomaly. +A _feature_ is any field in your index that you want to analyze for anomalies. A detector can discover anomalies across one or more features. You must choose an aggregation method for each feature: `average()`, `count()`, `sum()`, `min()`, or `max()`. The aggregation method determines what constitutes an anomaly. For example, if you choose `min()`, the detector focuses on finding anomalies based on the minimum values of your feature. If you choose `average()`, the detector finds anomalies based on the average values of your feature. -A multi-feature model correlates anomalies across all its features. The [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality) makes it less likely for multi-feature models to identify smaller anomalies as compared to a single-feature model. Adding more features might negatively impact the [precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall) of a model. A higher proportion of noise in your data might further amplify this negative impact. Selecting the optimal feature set is usually an iterative process. By default, the maximum number of features for a detector is 5. You can adjust this limit with the `plugins.anomaly_detection.max_anomaly_features` setting. -{: .note } +A multi-feature model correlates anomalies across all its features. The [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality) makes it less likely that multi-feature models will identify smaller anomalies as compared to a single-feature model. Adding more features can negatively impact the [precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall) of a model. A higher proportion of noise in your data can further amplify this negative impact. Selecting the optimal feature set is usually an iterative process. By default, the maximum number of features for a detector is `5`. You can adjust this limit using the `plugins.anomaly_detection.max_anomaly_features` setting. +{: .note} + +### Configuring a model based on an aggregation method To configure an anomaly detection model based on an aggregation method, follow these steps: -1. On the **Configure Model** page, enter the **Feature name** and check **Enable feature**. -1. For **Find anomalies based on**, select **Field Value**. -1. For **aggregation method**, select either **average()**, **count()**, **sum()**, **min()**, or **max()**. -1. For **Field**, select from the available options. +1. On the **Detectors** page, select the desired detector from the list. +2. On the detector's details page, select the **Actions** button to activate the dropdown menu and then select **Edit model configuration**. +3. On the **Edit model configuration** page, select the **Add another feature** button. +4. Enter a name in the **Feature name** field and select the **Enable feature** checkbox. +5. Select **Field value** from the dropdown menu under **Find anomalies based on**. +6. Select the desired aggregation from the dropdown menu under **Aggregation method**. +7. Select the desired field from the options listed in the dropdown menu under **Field**. +8. Select the **Save changes** button. + +### Configuring a model based on a JSON aggregation query To configure an anomaly detection model based on a JSON aggregation query, follow these steps: -1. On the **Configure Model** page, enter the **Feature name** and check **Enable feature**. -1. For **Find anomalies based on**, select **Custom expression**. You will see the JSON editor window open up. -1. Enter your JSON aggregation query in the editor. -For acceptable JSON query syntax, see [OpenSearch Query DSL]({{site.url}}{{site.baseurl}}/opensearch/query-dsl/index/) -{: .note } +1. On the **Edit model configuration** page, select the **Add another feature** button. +2. Enter a name in the **Feature name** field and select the **Enable feature** checkbox. +3. Select **Custom expression** from the dropdown menu under **Find anomalies based on**. The JSON editor window will open. +4. Enter your JSON aggregation query in the editor. +5. Select the **Save changes** button. -#### (Optional) Set category fields for high cardinality +For acceptable JSON query syntax, see [OpenSearch Query DSL]({{site.url}}{{site.baseurl}}/opensearch/query-dsl/index/). +{: .note} -You can categorize anomalies based on a keyword or IP field type. +### Setting categorical fields for high cardinality -The category field categorizes or slices the source time series with a dimension like IP addresses, product IDs, country codes, and so on. This helps to see a granular view of anomalies within each entity of the category field to isolate and debug issues. +You can categorize anomalies based on a keyword or IP field type. You can enable the **Categorical fields** option to categorize, or "slice," the source time series using a dimension, such as an IP address, a product ID, or a country code. This gives you a granular view of anomalies within each entity of the category field to help isolate and debug issues. -To set a category field, choose **Enable a category field** and select a field. You can’t change the category fields after you create the detector. +To set a category field, choose **Enable categorical fields** and select a field. You cannot change the category fields after you create the detector. Only a certain number of unique entities are supported in the category field. Use the following equation to calculate the recommended total number of entities supported in a cluster: @@ -147,7 +166,7 @@ Only a certain number of unique entities are supported in the category field. Us (data nodes * heap size * anomaly detection maximum memory percentage) / (entity model size of a detector) ``` -To get the entity model size of a detector, use the [profile detector API]({{site.url}}{{site.baseurl}}/monitoring-plugins/ad/api/#profile-detector). You can adjust the maximum memory percentage with the `plugins.anomaly_detection.model_max_size_percent` setting. +To get the detector's entity model size, use the [Profile Detector API]({{site.url}}{{site.baseurl}}/monitoring-plugins/ad/api/#profile-detector). You can adjust the maximum memory percentage using the `plugins.anomaly_detection.model_max_size_percent` setting. Consider a cluster with 3 data nodes, each with 8 GB of JVM heap size and the default 10% memory allocation. With an entity model size of 1 MB, the following formula calculates the estimated number of unique entities: @@ -155,81 +174,109 @@ Consider a cluster with 3 data nodes, each with 8 GB of JVM heap size and the de (8096 MB * 0.1 / 1 MB ) * 3 = 2429 ``` -If the actual total number of unique entities is higher than the number that you calculate (in this case, 2,429), the anomaly detector will attempt to model the extra entities. The detector prioritizes entities that occur more often and are more recent. +If the actual total number of unique entities is higher than the number that you calculate (in this case, 2,429), then the anomaly detector attempts to model the extra entities. The detector prioritizes both entities that occur more often and are more recent. -This formula serves as a starting point. Make sure to test it with a representative workload. You can find more information in the [Improving Anomaly Detection: One million entities in one minute](https://opensearch.org/blog/one-million-enitities-in-one-minute/) blog post. +This formula serves as a starting point. Make sure to test it with a representative workload. See the OpenSearch blog post [Improving Anomaly Detection: One million entities in one minute](https://opensearch.org/blog/one-million-enitities-in-one-minute/) for more information. {: .note } -#### (Advanced settings) Set a shingle size +### Setting a shingle size -Set the number of aggregation intervals from your data stream to consider in a detection window. It’s best to choose this value based on your actual data to see which one leads to the best results for your use case. +In the **Advanced settings** pane, you can set the number of data stream aggregation intervals to include in the detection window. Choose this value based on your actual data to find the optimal setting for your use case. To set the shingle size, select **Show** in the **Advanced settings** pane. Enter the desired size in the **intervals** field. -The anomaly detector expects the shingle size to be in the range of 1 and 60. The default shingle size is 8. We recommend that you don't choose 1 unless you have two or more features. Smaller values might increase [recall](https://en.wikipedia.org/wiki/Precision_and_recall) but also false positives. Larger values might be useful for ignoring noise in a signal. +The anomaly detector requires the shingle size to be between 1 and 128. The default is `8`. Use `1` only if you have at least two features. Values of less than `8` may increase [recall](https://en.wikipedia.org/wiki/Precision_and_recall) but also may increase false positives. Values greater than `8` may be useful for ignoring noise in a signal. -#### Preview sample anomalies +### Setting an imputation option -Preview sample anomalies and adjust the feature settings if needed. -For sample previews, the Anomaly Detection plugin selects a small number of data samples---for example, one data point every 30 minutes---and uses interpolation to estimate the remaining data points to approximate the actual feature data. It loads this sample dataset into the detector. The detector uses this sample dataset to generate a sample preview of anomaly results. +In the **Advanced settings** pane, you can set the imputation option. This allows you to manage missing data in your streams. The options include the following: -Examine the sample preview and use it to fine-tune your feature configurations (for example, enable or disable features) to get more accurate results. +- **Ignore Missing Data (Default):** The system continues without considering missing data points, keeping the existing data flow. +- **Fill with Custom Values:** Specify a custom value for each feature to replace missing data points, allowing for targeted imputation tailored to your data. +- **Fill with Zeros:** Replace missing values with zeros. This is ideal when the absence of data indicates a significant event, such as a drop to zero in event counts. +- **Use Previous Values:** Fill gaps with the last observed value to maintain continuity in your time-series data. This method treats missing data as non-anomalous, carrying forward the previous trend. -1. Choose **Preview sample anomalies**. - - If you don't see any sample anomaly result, check the detector interval and make sure you have more than 400 data points for some entities during the preview date range. -1. Choose **Next**. +Using these options can improve recall in anomaly detection. For instance, if you are monitoring for drops in event counts, including both partial and complete drops, then filling missing values with zeros helps detect significant data absences, improving detection recall. + +Be cautious when imputing extensively missing data, as excessive gaps can compromise model accuracy. Quality input is critical---poor data quality leads to poor model performance. The confidence score also decreases when imputations occur. You can check whether a feature value has been imputed using the `feature_imputed` field in the anomaly results index. See [Anomaly result mapping]({{site.url}}{{site.baseurl}}/monitoring-plugins/ad/result-mapping/) for more information. +{: note} + +### Suppressing anomalies with threshold-based rules + +In the **Advanced settings** pane, you can suppress anomalies by setting rules that define acceptable differences between the expected and actual values, either as an absolute value or a relative percentage. This helps reduce false anomalies caused by minor fluctuations, allowing you to focus on significant deviations. + +Suppose you want to detect substantial changes in log volume while ignoring small variations that are not meaningful. Without customized settings, the system might generate false alerts for minor changes, making it difficult to identify true anomalies. By setting suppression rules, you can ignore minor deviations and focus on real anomalous patterns. + +To suppress anomalies for deviations of less than 30% from the expected value, you can set the following rules: -## Step 3: Set up detector jobs +``` +Ignore anomalies for feature logVolume when the actual value is no more than 30% above the expected value. +Ignore anomalies for feature logVolume when the actual value is no more than 30% below the expected value. +``` + +Ensure that a feature, for example, `logVolume`, is properly defined in your model. Suppression rules are tied to specific features. +{: .note} + +If you expect that the log volume should differ by at least 10,000 from the expected value before being considered an anomaly, you can set absolute thresholds: -To start a real-time detector to find anomalies in your data in near real-time, check **Start real-time detector automatically (recommended)**. +``` +Ignore anomalies for feature logVolume when the actual value is no more than 10000 above the expected value. +Ignore anomalies for feature logVolume when the actual value is no more than 10000 below the expected value. +``` -Alternatively, if you want to perform historical analysis and find patterns in long historical data windows (weeks or months), check **Run historical analysis detection** and select a date range (at least 128 detection intervals). +If no custom suppression rules are set, then the system defaults to a filter that ignores anomalies with deviations of less than 20% from the expected value for each enabled feature. -Analyzing historical data helps you get familiar with the Anomaly Detection plugin. You can also evaluate the performance of a detector with historical data to further fine-tune it. +### Previewing sample anomalies -We recommend experimenting with historical analysis with different feature sets and checking the precision before moving on to real-time detectors. +You can preview anomalies based on sample feature input and adjust the feature settings as needed. The Anomaly Detection plugin selects a small number of data samples---for example, 1 data point every 30 minutes---and uses interpolation to estimate the remaining data points to approximate the actual feature data. The sample dataset is loaded into the detector, which then uses the sample dataset to generate a preview of the anomalies. + +1. Choose **Preview sample anomalies**. + - If sample anomaly results are not displayed, check the detector interval to verify that 400 or more data points are set for the entities during the preview date range. +2. Select the **Next** button. -## Step 4: Review and create +## Step 3: Setting up detector jobs -Review your detector settings and model configurations to make sure that they're valid and then select **Create detector**. +To start a detector to find anomalies in your data in near real time, select **Start real-time detector automatically (recommended)**. -![Anomaly detection results]({{site.url}}{{site.baseurl}}/images/review_ad.png) +Alternatively, if you want to perform historical analysis and find patterns in longer historical data windows (weeks or months), select the **Run historical analysis detection** box and select a date range of at least 128 detection intervals. -If you see any validation errors, edit the settings to fix the errors and then return back to this page. +Analyzing historical data can help to familiarize you with the Anomaly Detection plugin. For example, you can evaluate the performance of a detector against historical data in order to fine-tune it. + +You can experiment with historical analysis by using different feature sets and checking the precision before using real-time detectors. + +## Step 4: Reviewing detector settings + +Review your detector settings and model configurations to confirm that they are valid and then select **Create detector**. + +If a validation error occurs, edit the settings to correct the error and return to the detector page. {: .note } -## Step 5: Observe the results +## Step 5: Observing the results -Choose the **Real-time results** or **Historical analysis** tab. For real-time results, you need to wait for some time to see the anomaly results. If the detector interval is 10 minutes, the detector might take more than an hour to start, because its waiting for sufficient data to generate anomalies. +Choose either the **Real-time results** or **Historical analysis** tab. For real-time results, it will take some time to display the anomaly results. For example, if the detector interval is 10 minutes, then the detector may take an hour to initiate because it is waiting for sufficient data to be able to generate anomalies. -A shorter interval means the model passes the shingle process more quickly and starts to generate the anomaly results sooner. -Use the [profile detector]({{site.url}}{{site.baseurl}}/monitoring-plugins/ad/api#profile-detector) operation to make sure you have sufficient data points. +A shorter interval results in the model passing the shingle process more quickly and generating anomaly results sooner. You can use the [profile detector]({{site.url}}{{site.baseurl}}/monitoring-plugins/ad/api#profile-detector) operation to ensure that you have enough data points. -If you see the detector pending in "initialization" for longer than a day, aggregate your existing data using the detector interval to check for any missing data points. If you find a lot of missing data points from the aggregated data, consider increasing the detector interval. +If the detector is pending in "initialization" for longer than 1 day, aggregate your existing data and use the detector interval to check for any missing data points. If you find many missing data points, consider increasing the detector interval. -Choose and drag over the anomaly line chart to zoom in and see a more detailed view of an anomaly. +Click and drag over the anomaly line chart to zoom in and see a detailed view of an anomaly. {: .note } -Analyze anomalies with the following visualizations: +You can analyze anomalies using the following visualizations: -- **Live anomalies** (for real-time results) displays live anomaly results for the last 60 intervals. For example, if the interval is 10, it shows results for the last 600 minutes. The chart refreshes every 30 seconds. -- **Anomaly overview** (for real-time results) / **Anomaly history** (for historical analysis in the **Historical analysis** tab) plots the anomaly grade with the corresponding measure of confidence. This pane includes: +- **Live anomalies** (for real-time results) displays live anomaly results for the last 60 intervals. For example, if the interval is `10`, it shows results for the last 600 minutes. The chart refreshes every 30 seconds. +- **Anomaly overview** (for real-time results) or **Anomaly history** (for historical analysis on the **Historical analysis** tab) plot the anomaly grade with the corresponding measure of confidence. The pane includes: - The number of anomaly occurrences based on the given data-time range. - - The **Average anomaly grade**, a number between 0 and 1 that indicates how anomalous a data point is. An anomaly grade of 0 represents “not an anomaly,” and a non-zero value represents the relative severity of the anomaly. + - The **Average anomaly grade**, a number between 0 and 1 that indicates how anomalous a data point is. An anomaly grade of `0` represents "not an anomaly," and a non-zero value represents the relative severity of the anomaly. - **Confidence** estimate of the probability that the reported anomaly grade matches the expected anomaly grade. Confidence increases as the model observes more data and learns the data behavior and trends. Note that confidence is distinct from model accuracy. - **Last anomaly occurrence** is the time at which the last anomaly occurred. -Underneath **Anomaly overview**/**Anomaly history** are: +Underneath **Anomaly overview** or **Anomaly history** are: - **Feature breakdown** plots the features based on the aggregation method. You can vary the date-time range of the detector. Selecting a point on the feature line chart shows the **Feature output**, the number of times a field appears in your index, and the **Expected value**, a predicted value for the feature output. Where there is no anomaly, the output and expected values are equal. - ![Anomaly detection results]({{site.url}}{{site.baseurl}}/images/feature-contribution-ad.png) - - **Anomaly occurrences** shows the `Start time`, `End time`, `Data confidence`, and `Anomaly grade` for each detected anomaly. Selecting a point on the anomaly line chart shows **Feature Contribution**, the percentage of a feature that contributes to the anomaly -![Anomaly detection results]({{site.url}}{{site.baseurl}}/images/feature-contribution-ad.png) - - If you set the category field, you see an additional **Heat map** chart. The heat map correlates results for anomalous entities. This chart is empty until you select an anomalous entity. You also see the anomaly and feature line chart for the time period of the anomaly (`anomaly_grade` > 0). @@ -249,7 +296,7 @@ To see all the configuration settings for a detector, choose the **Detector conf 1. To make any changes to the detector configuration, or fine tune the time interval to minimize any false positives, go to the **Detector configuration** section and choose **Edit**. - You need to stop real-time and historical analysis to change its configuration. Confirm that you want to stop the detector and proceed. -1. To enable or disable features, in the **Features** section, choose **Edit** and adjust the feature settings as needed. After you make your changes, choose **Save and start detector**. +2. To enable or disable features, in the **Features** section, choose **Edit** and adjust the feature settings as needed. After you make your changes, choose **Save and start detector**. ## Step 8: Manage your detectors diff --git a/_observing-your-data/ad/result-mapping.md b/_observing-your-data/ad/result-mapping.md index 7e1482a0134..967b1856843 100644 --- a/_observing-your-data/ad/result-mapping.md +++ b/_observing-your-data/ad/result-mapping.md @@ -9,9 +9,7 @@ redirect_from: # Anomaly result mapping -If you enabled custom result index, the anomaly detection plugin stores the results in your own index. - -If the anomaly detector doesn’t detect an anomaly, the result has the following format: +When you select the **Enable custom result index** box on the **Custom result index** pane, the Anomaly Detection plugin will save the results to an index of your choosing. When the anomaly detector does not detect an anomaly, the result format is as follows: ```json { @@ -61,6 +59,7 @@ If the anomaly detector doesn’t detect an anomaly, the result has the followin "threshold": 1.2368549346675202 } ``` +{% include copy-curl.html %} ## Response body fields @@ -80,7 +79,83 @@ Field | Description `model_id` | A unique ID that identifies a model. If a detector is a single-stream detector (with no category field), it has only one model. If a detector is a high-cardinality detector (with one or more category fields), it might have multiple models, one for each entity. `threshold` | One of the criteria for a detector to classify a data point as an anomaly is that its `anomaly_score` must surpass a dynamic threshold. This field records the current threshold. -If an anomaly detector detects an anomaly, the result has the following format: +When the imputation option is enabled, the anomaly results include a `feature_imputed` array showing which features were modified due to missing data. If no features were imputed, then this is excluded. + +In the following example anomaly result output, the `processing_bytes_max` feature was imputed, as shown by the `imputed: true` status: + +```json +{ + "detector_id": "kzcZ43wBgEQAbjDnhzGF", + "schema_version": 5, + "data_start_time": 1635898161367, + "data_end_time": 1635898221367, + "feature_data": [ + { + "feature_id": "processing_bytes_max", + "feature_name": "processing bytes max", + "data": 2322 + }, + { + "feature_id": "processing_bytes_avg", + "feature_name": "processing bytes avg", + "data": 1718.6666666666667 + }, + { + "feature_id": "processing_bytes_min", + "feature_name": "processing bytes min", + "data": 1375 + }, + { + "feature_id": "processing_bytes_sum", + "feature_name": "processing bytes sum", + "data": 5156 + }, + { + "feature_id": "processing_time_max", + "feature_name": "processing time max", + "data": 31198 + } + ], + "execution_start_time": 1635898231577, + "execution_end_time": 1635898231622, + "anomaly_score": 1.8124904404395776, + "anomaly_grade": 0, + "confidence": 0.9802940756605277, + "entity": [ + { + "name": "process_name", + "value": "process_3" + } + ], + "model_id": "kzcZ43wBgEQAbjDnhzGF_entity_process_3", + "threshold": 1.2368549346675202, + "feature_imputed": [ + { + "feature_id": "processing_bytes_max", + "imputed": true + }, + { + "feature_id": "processing_bytes_avg", + "imputed": false + }, + { + "feature_id": "processing_bytes_min", + "imputed": false + }, + { + "feature_id": "processing_bytes_sum", + "imputed": false + }, + { + "feature_id": "processing_time_max", + "imputed": false + } + ] +} +``` +{% include copy-curl.html %} + +When an anomaly is detected, the result is provided in the following format: ```json { @@ -179,24 +254,23 @@ If an anomaly detector detects an anomaly, the result has the following format: "execution_start_time": 1635898427803 } ``` +{% include copy-curl.html %} -You can see the following additional fields: +Note that the result includes the following additional field. Field | Description :--- | :--- `relevant_attribution` | Represents the contribution of each input variable. The sum of the attributions is normalized to 1. `expected_values` | The expected value for each feature. -At times, the detector might detect an anomaly late. -Let's say the detector sees a random mix of the triples {1, 2, 3} and {2, 4, 5} that correspond to `slow weeks` and `busy weeks`, respectively. For example 1, 2, 3, 1, 2, 3, 2, 4, 5, 1, 2, 3, 2, 4, 5, ... and so on. -If the detector comes across a pattern {2, 2, X} and it's yet to see X, the detector infers that the pattern is anomalous, but it can't determine at this point which of the 2's is the cause. If X = 3, then the detector knows it's the first 2 in that unfinished triple, and if X = 5, then it's the second 2. If it's the first 2, then the detector detects the anomaly late. +The detector may be late in detecting an anomaly. For example: The detector observes a sequence of data that alternates between "slow weeks" (represented by the triples {1, 2, 3}) and "busy weeks" (represented by the triples {2, 4, 5}). If the detector comes across a pattern {2, 2, X}, where it has not yet seen the value that X will take, then the detector infers that the pattern is anomalous. However, it cannot determine which 2 is the cause. If X = 3, then the first 2 is the anomaly. If X = 5, then the second 2 is the anomaly. If it is the first 2, then the detector will be late in detecting the anomaly. -If a detector detects an anomaly late, the result has the following additional fields: +When a detector is late in detecting an anomaly, the result includes the following additional fields. Field | Description :--- | :--- -`past_values` | The actual input that triggered an anomaly. If `past_values` is null, the attributions or expected values are from the current input. If `past_values` is not null, the attributions or expected values are from a past input (for example, the previous two steps of the data [1,2,3]). -`approx_anomaly_start_time` | The approximate time of the actual input that triggers an anomaly. This field helps you understand when a detector flags an anomaly. Both single-stream and high-cardinality detectors don't query previous anomaly results because these queries are expensive operations. The cost is especially high for high-cardinality detectors that might have a lot of entities. If the data is not continuous, the accuracy of this field is low and the actual time that the detector detects an anomaly can be earlier. +`past_values` | The actual input that triggered an anomaly. If `past_values` is `null`, then the attributions or expected values are from the current input. If `past_values` is not `null`, then the attributions or expected values are from a past input (for example, the previous two steps of the data [1,2,3]). +`approx_anomaly_start_time` | The approximate time of the actual input that triggered an anomaly. This field helps you understand the time at which a detector flags an anomaly. Both single-stream and high-cardinality detectors do not query previous anomaly results because these queries are costly operations. The cost is especially high for high-cardinality detectors that may have many entities. If the data is not continuous, then the accuracy of this field is low and the actual time at which the detector detects an anomaly can be earlier. ```json { @@ -319,3 +393,4 @@ Field | Description "approx_anomaly_start_time": 1635883620000 } ``` +{% include copy-curl.html %} diff --git a/_observing-your-data/ad/security.md b/_observing-your-data/ad/security.md index 8eeaa3df41a..e4816cec468 100644 --- a/_observing-your-data/ad/security.md +++ b/_observing-your-data/ad/security.md @@ -23,6 +23,11 @@ As an admin user, you can use the Security plugin to assign specific permissions The Security plugin has two built-in roles that cover most anomaly detection use cases: `anomaly_full_access` and `anomaly_read_access`. For descriptions of each, see [Predefined roles]({{site.url}}{{site.baseurl}}/security/access-control/users-roles#predefined-roles). +If you use OpenSearch Dashboards to create your anomaly detectors, you may experience access issues even with `anomaly_full_access`. This issue has been resolved in OpenSearch 2.17, but for earlier versions, the following additional permissions need to be added: + +- `indices:data/read/search` -- You need this permission because the Anomaly Detection plugin needs to search the data source in order to validate whether there is enough data to train the model. +- `indices:admin/mappings/fields/get` and `indices:admin/mappings/fields/get*` -- You need these permissions to validate whether the given data source has a valid timestamp field and categorical field (in the case of creating a high-cardinality detector). + If these roles don't meet your needs, mix and match individual anomaly detection [permissions]({{site.url}}{{site.baseurl}}/security/access-control/permissions/) to suit your use case. Each action corresponds to an operation in the REST API. For example, the `cluster:admin/opensearch/ad/detector/delete` permission lets you delete detectors. ### A note on alerts and fine-grained access control @@ -31,6 +36,42 @@ When a trigger generates an alert, the detector and monitor configurations, the To reduce the chances of unintended users viewing metadata that could describe an index, we recommend that administrators enable role-based access control and keep these kinds of design elements in mind when assigning permissions to the intended group of users. See [Limit access by backend role](#advanced-limit-access-by-backend-role) for details. +### Selecting remote indexes with fine-grained access control + +To use a remote index as a data source for a detector, see the setup steps in [Authentication flow]({{site.url}}{{site.baseurl}}/search-plugins/cross-cluster-search/#authentication-flow) in [Cross-cluster search]({{site.url}}{{site.baseurl}}/search-plugins/cross-cluster-search/). You must use a role that exists in both the remote and local clusters. The remote cluster must map the chosen role to the same username as in the local cluster. + +--- + +#### Example: Create a new user on the local cluster + +1. Create a new user on the local cluster to use for detector creation: + +``` +curl -XPUT -k -u 'admin:' 'https://localhost:9200/_plugins/_security/api/internalusers/anomalyuser' -H 'Content-Type: application/json' -d '{"password":"password"}' +``` +{% include copy-curl.html %} + +2. Map the new user to the `anomaly_full_access` role: + +``` +curl -XPUT -k -u 'admin:' -H 'Content-Type: application/json' 'https://localhost:9200/_plugins/_security/api/rolesmapping/anomaly_full_access' -d '{"users" : ["anomalyuser"]}' +``` +{% include copy-curl.html %} + +3. On the remote cluster, create the same user and map `anomaly_full_access` to that role: + +``` +curl -XPUT -k -u 'admin:' 'https://localhost:9250/_plugins/_security/api/internalusers/anomalyuser' -H 'Content-Type: application/json' -d '{"password":"password"}' +curl -XPUT -k -u 'admin:' -H 'Content-Type: application/json' 'https://localhost:9250/_plugins/_security/api/rolesmapping/anomaly_full_access' -d '{"users" : ["anomalyuser"]}' +``` +{% include copy-curl.html %} + +--- + +### Custom results index + +To use a custom results index, you need additional permissions not included in the default roles provided by the OpenSearch Security plugin. To add these permissions, see [Step 1: Define a detector]({{site.url}}{{site.baseurl}}/observing-your-data/ad/index/#step-1-define-a-detector) in the [Anomaly detection]({{site.url}}{{site.baseurl}}/observing-your-data/ad/index/) documentation. + ## (Advanced) Limit access by backend role Use backend roles to configure fine-grained access to individual detectors based on roles. For example, users of different departments in an organization can view detectors owned by their own department. diff --git a/_observing-your-data/query-insights/grouping-top-n-queries.md b/_observing-your-data/query-insights/grouping-top-n-queries.md new file mode 100644 index 00000000000..28cbcbb8e55 --- /dev/null +++ b/_observing-your-data/query-insights/grouping-top-n-queries.md @@ -0,0 +1,331 @@ +--- +layout: default +title: Grouping top N queries +parent: Query insights +nav_order: 20 +--- + +# Grouping top N queries +**Introduced 2.17** +{: .label .label-purple } + +Monitoring the [top N queries]({{site.url}}{{site.baseurl}}/observing-your-data/query-insights/top-n-queries/) can help you to identify the most resource-intensive queries based on latency, CPU, and memory usage in a specified time window. However, if a single computationally expensive query is executed multiple times, it can occupy all top N query slots, potentially preventing other expensive queries from appearing in the list. To address this issue, you can group similar queries, gaining insight into various high-impact query groups. + +Starting with OpenSearch version 2.17, the top N queries can be grouped by `similarity`, with additional grouping options planned for future version releases. + +## Grouping queries by similarity + +Grouping queries by `similarity` organizes them based on the query structure, removing everything except the core query operations. + +For example, the following query: + +```json +{ + "query": { + "bool": { + "must": [ + { "exists": { "field": "field1" } } + ], + "query_string": { + "query": "search query" + } + } + } +} +``` + +Has the following corresponding query structure: + +```c +bool + must + exists + query_string +``` + +When queries share the same query structure, they are grouped together, ensuring that all similar queries belong to the same group. + + +## Aggregate metrics per group + +In addition to retrieving latency, CPU, and memory metrics for individual top N queries, you can obtain aggregate statistics for the +top N query groups. For each query group, the response includes the following statistics: +- The total latency, CPU usage, or memory usage (depending on the configured metric type) +- The total query count + +Using these statistics, you can calculate the average latency, CPU usage, or memory usage for each query group. +The response also includes one example query from the query group. + +## Configuring query grouping + +Before you enable query grouping, you must enable top N query monitoring for a metric type of your choice. For more information, see [Configuring top N query monitoring]({{site.url}}{{site.baseurl}}/observing-your-data/query-insights/top-n-queries/#configuring-top-n-query-monitoring). + +To configure grouping for top N queries, use the following steps. + +### Step 1: Enable top N query monitoring + +Ensure that top N query monitoring is enabled for at least one of the metrics: latency, CPU, or memory. For more information, see [Configuring top N query monitoring]({{site.url}}{{site.baseurl}}/observing-your-data/query-insights/top-n-queries/#configuring-top-n-query-monitoring). + +For example, to enable top N query monitoring by latency with the default settings, send the following request: + +```json +PUT _cluster/settings +{ + "persistent" : { + "search.insights.top_queries.latency.enabled" : true + } +} +``` +{% include copy-curl.html %} + +### Step 2: Configure query grouping + +Set the desired grouping method by updating the following cluster setting: + +```json +PUT _cluster/settings +{ + "persistent" : { + "search.insights.top_queries.group_by" : "similarity" + } +} +``` +{% include copy-curl.html %} + +The default value for the `group_by` setting is `none`, which disables grouping. As of OpenSearch 2.17, the supported values for `group_by` are `similarity` and `none`. + +### Step 3 (Optional): Limit the number of monitored query groups + +Optionally, you can limit the number of monitored query groups. Queries already included in the top N query list (the most resource-intensive queries) will not be considered in determining the limit. Essentially, the maximum applies only to other query groups, and the top N queries are tracked separately. This helps manage the tracking of query groups based on workload and query window size. + +To limit tracking to 100 query groups, send the following request: + +```json +PUT _cluster/settings +{ + "persistent" : { + "search.insights.top_queries.max_groups_excluding_topn" : 100 + } +} +``` +{% include copy-curl.html %} + +The default value for `max_groups_excluding_topn` is `100`, and you can set it to any value between `0` and `10,000`, inclusive. + +## Monitoring query groups + +To view the top N query groups, send the following request: + +```json +GET /_insights/top_queries +``` +{% include copy-curl.html %} + +The response contains the top N query groups: + +
+ + Response + + {: .text-delta} + +```json +{ + "top_queries": [ + { + "timestamp": 1725495127359, + "source": { + "query": { + "match_all": { + "boost": 1.0 + } + } + }, + "phase_latency_map": { + "expand": 0, + "query": 55, + "fetch": 3 + }, + "total_shards": 1, + "node_id": "ZbINz1KFS1OPeFmN-n5rdg", + "query_hashcode": "b4c4f69290df756021ca6276be5cbb75", + "task_resource_usages": [ + { + "action": "indices:data/read/search[phase/query]", + "taskId": 30, + "parentTaskId": 29, + "nodeId": "ZbINz1KFS1OPeFmN-n5rdg", + "taskResourceUsage": { + "cpu_time_in_nanos": 33249000, + "memory_in_bytes": 2896848 + } + }, + { + "action": "indices:data/read/search", + "taskId": 29, + "parentTaskId": -1, + "nodeId": "ZbINz1KFS1OPeFmN-n5rdg", + "taskResourceUsage": { + "cpu_time_in_nanos": 3151000, + "memory_in_bytes": 133936 + } + } + ], + "indices": [ + "my_index" + ], + "labels": {}, + "search_type": "query_then_fetch", + "measurements": { + "latency": { + "number": 160, + "count": 10, + "aggregationType": "AVERAGE" + } + } + }, + { + "timestamp": 1725495135160, + "source": { + "query": { + "term": { + "content": { + "value": "first", + "boost": 1.0 + } + } + } + }, + "phase_latency_map": { + "expand": 0, + "query": 18, + "fetch": 0 + }, + "total_shards": 1, + "node_id": "ZbINz1KFS1OPeFmN-n5rdg", + "query_hashcode": "c3620cc3d4df30fb3f95aeb2167289a4", + "task_resource_usages": [ + { + "action": "indices:data/read/search[phase/query]", + "taskId": 50, + "parentTaskId": 49, + "nodeId": "ZbINz1KFS1OPeFmN-n5rdg", + "taskResourceUsage": { + "cpu_time_in_nanos": 10188000, + "memory_in_bytes": 288136 + } + }, + { + "action": "indices:data/read/search", + "taskId": 49, + "parentTaskId": -1, + "nodeId": "ZbINz1KFS1OPeFmN-n5rdg", + "taskResourceUsage": { + "cpu_time_in_nanos": 262000, + "memory_in_bytes": 3216 + } + } + ], + "indices": [ + "my_index" + ], + "labels": {}, + "search_type": "query_then_fetch", + "measurements": { + "latency": { + "number": 109, + "count": 7, + "aggregationType": "AVERAGE" + } + } + }, + { + "timestamp": 1725495139766, + "source": { + "query": { + "match": { + "content": { + "query": "first", + "operator": "OR", + "prefix_length": 0, + "max_expansions": 50, + "fuzzy_transpositions": true, + "lenient": false, + "zero_terms_query": "NONE", + "auto_generate_synonyms_phrase_query": true, + "boost": 1.0 + } + } + } + }, + "phase_latency_map": { + "expand": 0, + "query": 15, + "fetch": 0 + }, + "total_shards": 1, + "node_id": "ZbINz1KFS1OPeFmN-n5rdg", + "query_hashcode": "484eaabecd13db65216b9e2ff5eee999", + "task_resource_usages": [ + { + "action": "indices:data/read/search[phase/query]", + "taskId": 64, + "parentTaskId": 63, + "nodeId": "ZbINz1KFS1OPeFmN-n5rdg", + "taskResourceUsage": { + "cpu_time_in_nanos": 12161000, + "memory_in_bytes": 473456 + } + }, + { + "action": "indices:data/read/search", + "taskId": 63, + "parentTaskId": -1, + "nodeId": "ZbINz1KFS1OPeFmN-n5rdg", + "taskResourceUsage": { + "cpu_time_in_nanos": 293000, + "memory_in_bytes": 3216 + } + } + ], + "indices": [ + "my_index" + ], + "labels": {}, + "search_type": "query_then_fetch", + "measurements": { + "latency": { + "number": 43, + "count": 3, + "aggregationType": "AVERAGE" + } + } + } + ] +} +``` + +
+ +## Response fields + +The response includes the following fields. + +Field | Data type | Description +:--- |:---| :--- +`top_queries` | Array | The list of top query groups. +`top_queries.timestamp` | Integer | The execution timestamp for the first query in the query group. +`top_queries.source` | Object | The first query in the query group. +`top_queries.phase_latency_map` | Object | The phase latency map for the first query in the query group. The map includes the amount of time, in milliseconds, that the query spent in the `expand`, `query`, and `fetch` phases. +`top_queries.total_shards` | Integer | The number of shards on which the first query was executed. +`top_queries.node_id` | String | The node ID of the node that coordinated the execution of the first query in the query group. +`top_queries.query_hashcode` | String | The hash code that uniquely identifies the query group. This is essentially the hash of the [query structure](#grouping-queries-by-similarity). +`top_queries.task_resource_usages` | Array of objects | The resource usage breakdown for the various tasks belonging to the first query in the query group. +`top_queries.indices` | Array | The indexes searched by the first query in the query group. +`top_queries.labels` | Object | Used to label the top query. +`top_queries.search_type` | String | The search request execution type (`query_then_fetch` or `dfs_query_then_fetch`). For more information, see the `search_type` parameter in the [Search API documentation]({{site.url}}{{site.baseurl}}/api-reference/search/#url-parameters). +`top_queries.measurements` | Object | The aggregate measurements for the query group. +`top_queries.measurements.latency` | Object | The aggregate latency measurements for the query group. +`top_queries.measurements.latency.number` | Integer | The total latency for the query group. +`top_queries.measurements.latency.count` | Integer | The number of queries in the query group. +`top_queries.measurements.latency.aggregationType` | String | The aggregation type for the current entry. If grouping by similarity is enabled, then `aggregationType` is `AVERAGE`. If it is not enabled, then `aggregationType` is `NONE`. \ No newline at end of file diff --git a/_observing-your-data/query-insights/index.md b/_observing-your-data/query-insights/index.md index 549371240f4..ef3a65bfcdf 100644 --- a/_observing-your-data/query-insights/index.md +++ b/_observing-your-data/query-insights/index.md @@ -7,8 +7,10 @@ has_toc: false --- # Query insights +**Introduced 2.12** +{: .label .label-purple } -To monitor and analyze the search queries within your OpenSearch clusterQuery information, you can obtain query insights. With minimal performance impact, query insights features aim to provide comprehensive insights into search query execution, enabling you to better understand search query characteristics, patterns, and system behavior during query execution stages. Query insights facilitate enhanced detection, diagnosis, and prevention of query performance issues, ultimately improving query processing performance, user experience, and overall system resilience. +To monitor and analyze the search queries within your OpenSearch cluster, you can obtain query insights. With minimal performance impact, query insights features aim to provide comprehensive insights into search query execution, enabling you to better understand search query characteristics, patterns, and system behavior during query execution stages. Query insights facilitate enhanced detection, diagnosis, and prevention of query performance issues, ultimately improving query processing performance, user experience, and overall system resilience. Typical use cases for query insights features include the following: @@ -36,4 +38,5 @@ For information about installing plugins, see [Installing plugins]({{site.url}}{ You can obtain the following information using Query Insights: - [Top n queries]({{site.url}}{{site.baseurl}}/observing-your-data/query-insights/top-n-queries/) +- [Grouping top N queries]({{site.url}}{{site.baseurl}}/observing-your-data/query-insights/grouping-top-n-queries/) - [Query metrics]({{site.url}}{{site.baseurl}}/observing-your-data/query-insights/query-metrics/) diff --git a/_observing-your-data/query-insights/query-metrics.md b/_observing-your-data/query-insights/query-metrics.md index c8caf21d657..beac8d4e185 100644 --- a/_observing-your-data/query-insights/query-metrics.md +++ b/_observing-your-data/query-insights/query-metrics.md @@ -2,10 +2,12 @@ layout: default title: Query metrics parent: Query insights -nav_order: 20 +nav_order: 30 --- # Query metrics +**Introduced 2.16** +{: .label .label-purple } Key query [metrics](#metrics), such as aggregation types, query types, latency, and resource usage per query type, are captured along the search path by using the OpenTelemetry (OTel) instrumentation framework. The telemetry data can be consumed using OTel metrics [exporters]({{site.url}}{{site.baseurl}}/observing-your-data/trace/distributed-tracing/#exporters). diff --git a/_observing-your-data/query-insights/top-n-queries.md b/_observing-your-data/query-insights/top-n-queries.md index f07fd2dfef7..b63d670926b 100644 --- a/_observing-your-data/query-insights/top-n-queries.md +++ b/_observing-your-data/query-insights/top-n-queries.md @@ -7,7 +7,7 @@ nav_order: 10 # Top N queries -Monitoring the top N queries in query insights features can help you gain real-time insights into the top queries with high latency within a certain time frame (for example, the last hour). +Monitoring the top N queries using query insights allows you to gain real-time visibility into the queries with the highest latency or resource consumption in a specified time period (for example, the last hour). ## Configuring top N query monitoring @@ -72,14 +72,14 @@ PUT _cluster/settings ## Monitoring the top N queries -You can use the Insights API endpoint to obtain the top N queries for all metric types: +You can use the Insights API endpoint to retrieve the top N queries. This API returns top N `latency` results by default. ```json GET /_insights/top_queries ``` {% include copy-curl.html %} -Specify a metric type to filter the response: +Specify the `type` parameter to retrieve the top N results for other metric types. The results will be sorted in descending order based on the specified metric type. ```json GET /_insights/top_queries?type=latency @@ -96,6 +96,9 @@ GET /_insights/top_queries?type=memory ``` {% include copy-curl.html %} +If your query returns no results, ensure that top N query monitoring is enabled for the target metric type and that search requests were made within the current [time window](#configuring-the-window-size). +{: .important} + ## Exporting top N query data You can configure your desired exporter to export top N query data to different sinks, allowing for better monitoring and analysis of your OpenSearch queries. Currently, the following exporters are supported: diff --git a/_query-dsl/geo-and-xy/geo-bounding-box.md b/_query-dsl/geo-and-xy/geo-bounding-box.md index 1112a4278e8..66fcc224d61 100644 --- a/_query-dsl/geo-and-xy/geo-bounding-box.md +++ b/_query-dsl/geo-and-xy/geo-bounding-box.md @@ -173,11 +173,11 @@ GET testindex1/_search ``` {% include copy-curl.html %} -## Request fields +## Parameters -Geo-bounding box queries accept the following fields. +Geo-bounding box queries accept the following parameters. -Field | Data type | Description +Parameter | Data type | Description :--- | :--- | :--- `_name` | String | The name of the filter. Optional. `validation_method` | String | The validation method. Valid values are `IGNORE_MALFORMED` (accept geopoints with invalid coordinates), `COERCE` (try to coerce coordinates to valid values), and `STRICT` (return an error when coordinates are invalid). Default is `STRICT`. diff --git a/_query-dsl/geo-and-xy/geodistance.md b/_query-dsl/geo-and-xy/geodistance.md index b272cad81ee..3eef58bc699 100644 --- a/_query-dsl/geo-and-xy/geodistance.md +++ b/_query-dsl/geo-and-xy/geodistance.md @@ -103,11 +103,11 @@ The response contains the matching document: } ``` -## Request fields +## Parameters -Geodistance queries accept the following fields. +Geodistance queries accept the following parameters. -Field | Data type | Description +Parameter | Data type | Description :--- | :--- | :--- `_name` | String | The name of the filter. Optional. `distance` | String | The distance within which to match the points. This distance is the radius of a circle centered at the specified point. For supported distance units, see [Distance units]({{site.url}}{{site.baseurl}}/api-reference/common-parameters/#distance-units). Required. diff --git a/_query-dsl/geo-and-xy/geopolygon.md b/_query-dsl/geo-and-xy/geopolygon.md index 980a0c5a631..810e48f2b7f 100644 --- a/_query-dsl/geo-and-xy/geopolygon.md +++ b/_query-dsl/geo-and-xy/geopolygon.md @@ -161,11 +161,11 @@ However, if you specify the vertices in the following order: The response returns no results. -## Request fields +## Parameters -Geopolygon queries accept the following fields. +Geopolygon queries accept the following parameters. -Field | Data type | Description +Parameter | Data type | Description :--- | :--- | :--- `_name` | String | The name of the filter. Optional. `validation_method` | String | The validation method. Valid values are `IGNORE_MALFORMED` (accept geopoints with invalid coordinates), `COERCE` (try to coerce coordinates to valid values), and `STRICT` (return an error when coordinates are invalid). Optional. Default is `STRICT`. diff --git a/_query-dsl/geo-and-xy/geoshape.md b/_query-dsl/geo-and-xy/geoshape.md index 42948666f4f..5b144b06d6c 100644 --- a/_query-dsl/geo-and-xy/geoshape.md +++ b/_query-dsl/geo-and-xy/geoshape.md @@ -25,15 +25,15 @@ Relation | Description | Supporting geographic field type ## Defining the shape in a geoshape query -You can define the shape to filter documents in a geoshape query either by providing a new shape definition at query time or by referencing the name of a shape pre-indexed in another index. +You can define the shape to filter documents in a geoshape query either by [providing a new shape definition at query time](#using-a-new-shape-definition) or by [referencing the name of a shape pre-indexed in another index](#using-a-pre-indexed-shape-definition). -### Using a new shape definition +## Using a new shape definition To provide a new shape to a geoshape query, define it in the `geo_shape` field. You must define the geoshape in [GeoJSON format](https://geojson.org/). The following example illustrates searching for documents containing geoshapes that match a geoshape defined at query time. -#### Step 1: Create an index +### Step 1: Create an index First, create an index and map the `location` field as a `geo_shape`: @@ -422,7 +422,7 @@ GET /testindex/_search Geoshape queries whose geometry collection contains a linestring or a multilinestring do not support the `WITHIN` relation. {: .note} -### Using a pre-indexed shape definition +## Using a pre-indexed shape definition When constructing a geoshape query, you can also reference the name of a shape pre-indexed in another index. Using this method, you can define a geoshape at index time and refer to it by name at search time. @@ -721,10 +721,10 @@ The response returns document 1: Note that when you indexed the geopoints, you specified their coordinates in `"latitude, longitude"` format. When you search for matching documents, the coordinate array is in `[longitude, latitude]` format. Thus, document 1 is returned in the results but document 2 is not. -## Request fields +## Parameters -Geoshape queries accept the following fields. +Geoshape queries accept the following parameters. -Field | Data type | Description +Parameter | Data type | Description :--- | :--- | :--- `ignore_unmapped` | Boolean | Specifies whether to ignore an unmapped field. If set to `true`, then the query does not return any documents that contain an unmapped field. If set to `false`, then an exception is thrown when the field is unmapped. Optional. Default is `false`. \ No newline at end of file diff --git a/_query-dsl/joining/has-child.md b/_query-dsl/joining/has-child.md new file mode 100644 index 00000000000..a6b67ea8caf --- /dev/null +++ b/_query-dsl/joining/has-child.md @@ -0,0 +1,398 @@ +--- +layout: default +title: Has child +parent: Joining queries +nav_order: 10 +--- + +# Has child query + +The `has_child` query returns parent documents whose child documents match a specific query. You can establish parent-child relationships between documents in the same index by using a [join]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/join/) field type. + +The `has_child` query is slower than other queries because of the join operation it performs. Performance decreases as the number of matching child documents pointing to different parent documents increases. Each `has_child` query in your search may significantly impact query performance. If you prioritize speed, avoid using this query or limit its usage as much as possible. +{: .warning} + +## Example + +Before you can run a `has_child` query, your index must contain a [join]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/join/) field in order to establish parent-child relationships. The index mapping request uses the following format: + +```json +PUT /example_index +{ + "mappings": { + "properties": { + "relationship_field": { + "type": "join", + "relations": { + "parent_doc": "child_doc" + } + } + } + } +} +``` +{% include copy-curl.html %} + +In this example, you'll configure an index that contains documents representing products and their brands. + +First, create the index and establish the parent-child relationship between `brand` and `product`: + +```json +PUT testindex1 +{ + "mappings": { + "properties": { + "product_to_brand": { + "type": "join", + "relations": { + "brand": "product" + } + } + } + } +} +``` +{% include copy-curl.html %} + +Index two parent (brand) documents: + +```json +PUT testindex1/_doc/1 +{ + "name": "Luxury brand", + "product_to_brand" : "brand" +} +``` +{% include copy-curl.html %} + +```json +PUT testindex1/_doc/2 +{ + "name": "Economy brand", + "product_to_brand" : "brand" +} +``` +{% include copy-curl.html %} + +Index three child (product) documents: + +```json +PUT testindex1/_doc/3?routing=1 +{ + "name": "Mechanical watch", + "sales_count": 150, + "product_to_brand": { + "name": "product", + "parent": "1" + } +} +``` +{% include copy-curl.html %} + +```json +PUT testindex1/_doc/4?routing=2 +{ + "name": "Electronic watch", + "sales_count": 300, + "product_to_brand": { + "name": "product", + "parent": "2" + } +} +``` +{% include copy-curl.html %} + +```json +PUT testindex1/_doc/5?routing=2 +{ + "name": "Digital watch", + "sales_count": 100, + "product_to_brand": { + "name": "product", + "parent": "2" + } +} +``` +{% include copy-curl.html %} + +To search for the parent of a child, use a `has_child` query. The following query returns parent documents (brands) that make watches: + +```json +GET testindex1/_search +{ + "query" : { + "has_child": { + "type":"product", + "query": { + "match" : { + "name": "watch" + } + } + } + } +} +``` +{% include copy-curl.html %} + +The response returns both brands: + +```json +{ + "took": 15, + "timed_out": false, + "_shards": { + "total": 1, + "successful": 1, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 2, + "relation": "eq" + }, + "max_score": 1, + "hits": [ + { + "_index": "testindex1", + "_id": "1", + "_score": 1, + "_source": { + "name": "Luxury brand", + "product_to_brand": "brand" + } + }, + { + "_index": "testindex1", + "_id": "2", + "_score": 1, + "_source": { + "name": "Economy brand", + "product_to_brand": "brand" + } + } + ] + } +} +``` + +## Retrieving inner hits + +To return child documents that matched the query, provide the `inner_hits` parameter: + +```json +GET testindex1/_search +{ + "query" : { + "has_child": { + "type":"product", + "query": { + "match" : { + "name": "watch" + } + }, + "inner_hits": {} + } + } +} +``` +{% include copy-curl.html %} + +The response contains child documents in the `inner_hits` field: + +```json +{ + "took": 52, + "timed_out": false, + "_shards": { + "total": 1, + "successful": 1, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 2, + "relation": "eq" + }, + "max_score": 1, + "hits": [ + { + "_index": "testindex1", + "_id": "1", + "_score": 1, + "_source": { + "name": "Luxury brand", + "product_to_brand": "brand" + }, + "inner_hits": { + "product": { + "hits": { + "total": { + "value": 1, + "relation": "eq" + }, + "max_score": 0.53899646, + "hits": [ + { + "_index": "testindex1", + "_id": "3", + "_score": 0.53899646, + "_routing": "1", + "_source": { + "name": "Mechanical watch", + "sales_count": 150, + "product_to_brand": { + "name": "product", + "parent": "1" + } + } + } + ] + } + } + } + }, + { + "_index": "testindex1", + "_id": "2", + "_score": 1, + "_source": { + "name": "Economy brand", + "product_to_brand": "brand" + }, + "inner_hits": { + "product": { + "hits": { + "total": { + "value": 2, + "relation": "eq" + }, + "max_score": 0.53899646, + "hits": [ + { + "_index": "testindex1", + "_id": "4", + "_score": 0.53899646, + "_routing": "2", + "_source": { + "name": "Electronic watch", + "sales_count": 300, + "product_to_brand": { + "name": "product", + "parent": "2" + } + } + }, + { + "_index": "testindex1", + "_id": "5", + "_score": 0.53899646, + "_routing": "2", + "_source": { + "name": "Digital watch", + "sales_count": 100, + "product_to_brand": { + "name": "product", + "parent": "2" + } + } + } + ] + } + } + } + } + ] + } +} +``` + +For more information about retrieving inner hits, see [Inner hits]({{site.url}}{{site.baseurl}}/search-plugins/searching-data/inner-hits/). + +## Parameters + +The following table lists all top-level parameters supported by `has_child` queries. + +| Parameter | Required/Optional | Description | +|:---|:---|:---| +| `type` | Required | Specifies the name of the child relationship as defined in the `join` field mapping. | +| `query` | Required | The query to run on child documents. If a child document matches the query, the parent document is returned. | +| `ignore_unmapped` | Optional | Indicates whether to ignore unmapped `type` fields and not return documents instead of throwing an error. You can provide this parameter when querying multiple indexes, some of which may not contain the `type` field. Default is `false`. | +| `max_children` | Optional | The maximum number of matching child documents for a parent document. If exceeded, the parent document is excluded from the search results. | +| `min_children` | Optional | The minimum number of matching child documents required for a parent document to be included in the results. If not met, the parent is excluded. Default is `1`.| +| `score_mode` | Optional | Defines how scores of matching child documents influence the parent document's score. Valid values are:
- `none`: Ignores the relevance scores of child documents and assigns a score of `0` to the parent document.
- `avg`: Uses the average relevance score of all matching child documents.
- `max`: Assigns the highest relevance score from the matching child documents to the parent.
- `min`: Assigns the lowest relevance score from the matching child documents to the parent.
- `sum`: Sums the relevance scores of all matching child documents.
Default is `none`. | +| `inner_hits` | Optional | If provided, returns the underlying hits (child documents) that matched the query. | + + +## Sorting limitations + +The `has_child` query does not support [sorting results]({{site.url}}{{site.baseurl}}/search-plugins/searching-data/sort/) using standard sorting options. If you need to sort parent documents by fields in their child documents, you can use a [`function_score` query]({{site.url}}{{site.baseurl}}/query-dsl/compound/function-score/) and sort by the parent document's score. + +In the preceding example, you can sort parent documents (brands) based on the `sales_count` of their child products. This query multiplies the score by the `sales_count` field of the child documents and assigns the highest relevance score from the matching child documents to the parent: + +```json +GET testindex1/_search +{ + "query": { + "has_child": { + "type": "product", + "query": { + "function_score": { + "script_score": { + "script": "_score * doc['sales_count'].value" + } + } + }, + "score_mode": "max" + } + } +} +``` +{% include copy-curl.html %} + +The response contains the brands sorted by the highest child `sales_count`: + +```json +{ + "took": 6, + "timed_out": false, + "_shards": { + "total": 1, + "successful": 1, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 2, + "relation": "eq" + }, + "max_score": 300, + "hits": [ + { + "_index": "testindex1", + "_id": "2", + "_score": 300, + "_source": { + "name": "Economy brand", + "product_to_brand": "brand" + } + }, + { + "_index": "testindex1", + "_id": "1", + "_score": 150, + "_source": { + "name": "Luxury brand", + "product_to_brand": "brand" + } + } + ] + } +} +``` + +## Next steps + +- Learn more about [retrieving inner hits]({{site.url}}{{site.baseurl}}/search-plugins/searching-data/inner-hits/). \ No newline at end of file diff --git a/_query-dsl/joining/has-parent.md b/_query-dsl/joining/has-parent.md new file mode 100644 index 00000000000..2232009fe73 --- /dev/null +++ b/_query-dsl/joining/has-parent.md @@ -0,0 +1,358 @@ +--- +layout: default +title: Has parent +parent: Joining queries +nav_order: 20 +--- + +# Has parent query + +The `has_parent` query returns child documents whose parent documents match a specific query. You can establish parent-child relationships between documents in the same index by using a [join]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/join/) field type. + +The `has_parent` query is slower than other queries because of the join operation it performs. Performance decreases as the number of matching parent documents increases. Each `has_parent` query in your search may significantly impact query performance. If you prioritize speed, avoid using this query or limit its usage as much as possible. +{: .warning} + +## Example + +Before you can run a `has_parent` query, your index must contain a [join]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/join/) field in order to establish parent-child relationships. The index mapping request uses the following format: + +```json +PUT /example_index +{ + "mappings": { + "properties": { + "relationship_field": { + "type": "join", + "relations": { + "parent_doc": "child_doc" + } + } + } + } +} +``` +{% include copy-curl.html %} + +For this example, first configure an index that contains documents representing products and their brands as described in the [`has_child` query example]({{site.url}}{{site.baseurl}}/query-dsl/joining/has-child/). + +To search for the child of a parent, use a `has_parent` query. The following query returns child documents (products) made by the brand matching the query `economy`: + +```json +GET testindex1/_search +{ + "query" : { + "has_parent": { + "parent_type":"brand", + "query": { + "match" : { + "name": "economy" + } + } + } + } +} +``` +{% include copy-curl.html %} + +The response returns all products made by the brand: + +```json +{ + "took": 11, + "timed_out": false, + "_shards": { + "total": 1, + "successful": 1, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 2, + "relation": "eq" + }, + "max_score": 1, + "hits": [ + { + "_index": "testindex1", + "_id": "4", + "_score": 1, + "_routing": "2", + "_source": { + "name": "Electronic watch", + "sales_count": 300, + "product_to_brand": { + "name": "product", + "parent": "2" + } + } + }, + { + "_index": "testindex1", + "_id": "5", + "_score": 1, + "_routing": "2", + "_source": { + "name": "Digital watch", + "sales_count": 100, + "product_to_brand": { + "name": "product", + "parent": "2" + } + } + } + ] + } +} +``` + +## Retrieving inner hits + +To return parent documents that matched the query, provide the `inner_hits` parameter: + +```json +GET testindex1/_search +{ + "query" : { + "has_parent": { + "parent_type":"brand", + "query": { + "match" : { + "name": "economy" + } + }, + "inner_hits": {} + } + } +} +``` +{% include copy-curl.html %} + +The response contains parent documents in the `inner_hits` field: + +```json +{ + "took": 11, + "timed_out": false, + "_shards": { + "total": 1, + "successful": 1, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 2, + "relation": "eq" + }, + "max_score": 1, + "hits": [ + { + "_index": "testindex1", + "_id": "4", + "_score": 1, + "_routing": "2", + "_source": { + "name": "Electronic watch", + "sales_count": 300, + "product_to_brand": { + "name": "product", + "parent": "2" + } + }, + "inner_hits": { + "brand": { + "hits": { + "total": { + "value": 1, + "relation": "eq" + }, + "max_score": 1.3862942, + "hits": [ + { + "_index": "testindex1", + "_id": "2", + "_score": 1.3862942, + "_source": { + "name": "Economy brand", + "product_to_brand": "brand" + } + } + ] + } + } + } + }, + { + "_index": "testindex1", + "_id": "5", + "_score": 1, + "_routing": "2", + "_source": { + "name": "Digital watch", + "sales_count": 100, + "product_to_brand": { + "name": "product", + "parent": "2" + } + }, + "inner_hits": { + "brand": { + "hits": { + "total": { + "value": 1, + "relation": "eq" + }, + "max_score": 1.3862942, + "hits": [ + { + "_index": "testindex1", + "_id": "2", + "_score": 1.3862942, + "_source": { + "name": "Economy brand", + "product_to_brand": "brand" + } + } + ] + } + } + } + } + ] + } +} +``` + +For more information about retrieving inner hits, see [Inner hits]({{site.url}}{{site.baseurl}}/search-plugins/searching-data/inner-hits/). + +## Parameters + +The following table lists all top-level parameters supported by `has_parent` queries. + +| Parameter | Required/Optional | Description | +|:---|:---|:---| +| `parent_type` | Required | Specifies the name of the parent relationship as defined in the `join` field mapping. | +| `query` | Required | The query to run on parent documents. If a parent document matches the query, the child document is returned. | +| `ignore_unmapped` | Optional | Indicates whether to ignore unmapped `parent_type` fields and not return documents instead of throwing an error. You can provide this parameter when querying multiple indexes, some of which may not contain the `parent_type` field. Default is `false`. | +| `score` | Optional | Indicates whether the relevance score of a matching parent document is aggregated into its child documents. If `false`, then the relevance score of the parent document is ignored, and each child document is assigned a relevance score equal to the query's `boost`, which defaults to `1`. If `true`, then the relevance score of the matching parent document is aggregated into the relevance scores of its child documents. Default is `false`. | +| `inner_hits` | Optional | If provided, returns the underlying hits (parent documents) that matched the query. | + + +## Sorting limitations + +The `has_parent` query does not support [sorting results]({{site.url}}{{site.baseurl}}/search-plugins/searching-data/sort/) using standard sorting options. If you need to sort child documents by fields in their parent documents, you can use a [`function_score` query]({{site.url}}{{site.baseurl}}/query-dsl/compound/function-score/) and sort by the child document's score. + +For the preceding example, first add a `customer_satisfaction` field by which you'll sort the child documents belonging to the parent (brand) documents: + +```json +PUT testindex1/_doc/1 +{ + "name": "Luxury watch brand", + "product_to_brand" : "brand", + "customer_satisfaction": 4.5 +} +``` +{% include copy-curl.html %} + +```json +PUT testindex1/_doc/2 +{ + "name": "Economy watch brand", + "product_to_brand" : "brand", + "customer_satisfaction": 3.9 +} +``` +{% include copy-curl.html %} + +Now you can sort child documents (products) based on the `customer_satisfaction` field of their parent brands. This query multiplies the score by the `customer_satisfaction` field of the parent documents: + +```json +GET testindex1/_search +{ + "query": { + "has_parent": { + "parent_type": "brand", + "score": true, + "query": { + "function_score": { + "script_score": { + "script": "_score * doc['customer_satisfaction'].value" + } + } + } + } + } +} +``` +{% include copy-curl.html %} + +The response contains the products, sorted by the highest parent `customer_satisfaction`: + +```json +{ + "took": 11, + "timed_out": false, + "_shards": { + "total": 1, + "successful": 1, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 3, + "relation": "eq" + }, + "max_score": 4.5, + "hits": [ + { + "_index": "testindex1", + "_id": "3", + "_score": 4.5, + "_routing": "1", + "_source": { + "name": "Mechanical watch", + "sales_count": 150, + "product_to_brand": { + "name": "product", + "parent": "1" + } + } + }, + { + "_index": "testindex1", + "_id": "4", + "_score": 3.9, + "_routing": "2", + "_source": { + "name": "Electronic watch", + "sales_count": 300, + "product_to_brand": { + "name": "product", + "parent": "2" + } + } + }, + { + "_index": "testindex1", + "_id": "5", + "_score": 3.9, + "_routing": "2", + "_source": { + "name": "Digital watch", + "sales_count": 100, + "product_to_brand": { + "name": "product", + "parent": "2" + } + } + } + ] + } +} +``` + +## Next steps + +- Learn more about [retrieving inner hits]({{site.url}}{{site.baseurl}}/search-plugins/searching-data/inner-hits/). \ No newline at end of file diff --git a/_query-dsl/joining/index.md b/_query-dsl/joining/index.md new file mode 100644 index 00000000000..74ad7f1ea1a --- /dev/null +++ b/_query-dsl/joining/index.md @@ -0,0 +1,21 @@ +--- +layout: default +title: Joining queries +has_children: true +nav_order: 55 +has_toc: false +redirect_from: + - /query-dsl/joining/ +--- + +# Joining queries + +OpenSearch is a distributed system in which data is spread across multiple nodes. Thus, running a SQL-like JOIN operation in OpenSearch is resource intensive. As an alternative, OpenSearch provides the following queries that perform join operations and are optimized for scaling across multiple nodes: + +- `nested` queries: Act as wrappers for other queries to search [nested]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/nested/) fields. The nested field objects are searched as though they were indexed as separate documents. +- [`has_child`]({{site.url}}{{site.baseurl}}/query-dsl/joining/has-child/) queries: Search for parent documents whose child documents match the query. +- [`has_parent`]({{site.url}}{{site.baseurl}}/query-dsl/joining/has-parent/) queries: Search for child documents whose parent documents match the query. +- `parent_id` queries: A [join]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/join/) field type establishes a parent/child relationship between documents in the same index. `parent_id` queries search for child documents that are joined to a specific parent document. + +If [`search.allow_expensive_queries`]({{site.url}}{{site.baseurl}}/query-dsl/index/#expensive-queries) is set to `false`, then joining queries are not executed. +{: .important} \ No newline at end of file diff --git a/_query-dsl/specialized/neural.md b/_query-dsl/specialized/neural.md index 14b930cdb69..6cd534b87f9 100644 --- a/_query-dsl/specialized/neural.md +++ b/_query-dsl/specialized/neural.md @@ -35,6 +35,8 @@ Field | Data type | Required/Optional | Description `min_score` | Float | Optional | The minimum score threshold for the search results. Only one variable, either `k`, `min_score`, or `max_distance`, can be specified. For more information, see [k-NN radial search]({{site.url}}{{site.baseurl}}/search-plugins/knn/radial-search-knn/). `max_distance` | Float | Optional | The maximum distance threshold for the search results. Only one variable, either `k`, `min_score`, or `max_distance`, can be specified. For more information, see [k-NN radial search]({{site.url}}{{site.baseurl}}/search-plugins/knn/radial-search-knn/). `filter` | Object | Optional | A query that can be used to reduce the number of documents considered. For more information about filter usage, see [k-NN search with filters]({{site.url}}{{site.baseurl}}/search-plugins/knn/filter-search-knn/). **Important**: Filter can only be used with the `faiss` or `lucene` engines. +`method_parameters` | Object | Optional | Parameters passed to the k-NN index during search. See [Additional query parameters]({{site.url}}{{site.baseurl}}/search-plugins/knn/approximate-knn/#additional-query-parameters). +`rescore` | Object | Optional | Parameters for configuring rescoring functionality for k-NN indexes built using quantization. See [Rescoring]({{site.url}}{{site.baseurl}}/search-plugins/knn/approximate-knn/#rescoring-quantized-results-using-full-precision). #### Example request diff --git a/_query-dsl/term/fuzzy.md b/_query-dsl/term/fuzzy.md index 7a426fd7948..0e448bbbdad 100644 --- a/_query-dsl/term/fuzzy.md +++ b/_query-dsl/term/fuzzy.md @@ -89,5 +89,5 @@ Parameter | Data type | Description Specifying a large value in `max_expansions` can lead to poor performance, especially if `prefix_length` is set to `0`, because of the large number of variations of the word that OpenSearch tries to match. {: .warning} -If [`search.allow_expensive_queries`]({{site.url}}{{site.baseurl}}/query-dsl/index/#expensive-queries) is set to `false`, fuzzy queries are not run. +If [`search.allow_expensive_queries`]({{site.url}}{{site.baseurl}}/query-dsl/index/#expensive-queries) is set to `false`, then fuzzy queries are not executed. {: .important} diff --git a/_query-dsl/term/prefix.md b/_query-dsl/term/prefix.md index 2a429c9f0e3..087c26cc302 100644 --- a/_query-dsl/term/prefix.md +++ b/_query-dsl/term/prefix.md @@ -66,5 +66,5 @@ Parameter | Data type | Description `case_insensitive` | Boolean | If `true`, allows case-insensitive matching of the value with the indexed field values. Default is `false` (case sensitivity is determined by the field's mapping). `rewrite` | String | Determines how OpenSearch rewrites and scores multi-term queries. Valid values are `constant_score`, `scoring_boolean`, `constant_score_boolean`, `top_terms_N`, `top_terms_boost_N`, and `top_terms_blended_freqs_N`. Default is `constant_score`. -If [`search.allow_expensive_queries`]({{site.url}}{{site.baseurl}}/query-dsl/index/#expensive-queries) is set to `false`, prefix queries are not run. If `index_prefixes` is enabled, the `search.allow_expensive_queries` setting is ignored and an optimized query is built and run. +If [`search.allow_expensive_queries`]({{site.url}}{{site.baseurl}}/query-dsl/index/#expensive-queries) is set to `false`, then prefix queries are not executed. If `index_prefixes` is enabled, then the `search.allow_expensive_queries` setting is ignored and an optimized query is built and run. {: .important} diff --git a/_query-dsl/term/range.md b/_query-dsl/term/range.md index ceb264db766..1e6fece2181 100644 --- a/_query-dsl/term/range.md +++ b/_query-dsl/term/range.md @@ -217,5 +217,5 @@ Parameter | Data type | Description `boost` | Floating-point | A floating-point value that specifies the weight of this field toward the relevance score. Values above 1.0 increase the field’s relevance. Values between 0.0 and 1.0 decrease the field’s relevance. Default is 1.0. `time_zone` | String | The time zone used to convert [`date`]({{site.url}}{{site.baseurl}}/opensearch/supported-field-types/date/) values to UTC in the query. Valid values are ISO 8601 [UTC offsets](https://en.wikipedia.org/wiki/List_of_UTC_offsets) and [IANA time zone IDs](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones). For more information, see [Time zone](#time-zone). -If [`search.allow_expensive_queries`]({{site.url}}{{site.baseurl}}/query-dsl/index/#expensive-queries) is set to `false`, range queries on [`text`]({{site.url}}{{site.baseurl}}/opensearch/supported-field-types/text/) and [`keyword`]({{site.url}}{{site.baseurl}}/opensearch/supported-field-types/keyword/) fields are not run. +If [`search.allow_expensive_queries`]({{site.url}}{{site.baseurl}}/query-dsl/index/#expensive-queries) is set to `false`, then range queries on [`text`]({{site.url}}{{site.baseurl}}/opensearch/supported-field-types/text/) and [`keyword`]({{site.url}}{{site.baseurl}}/opensearch/supported-field-types/keyword/) fields are not executed. {: .important} diff --git a/_query-dsl/term/regexp.md b/_query-dsl/term/regexp.md index 4a038729c0c..34a0c916ce2 100644 --- a/_query-dsl/term/regexp.md +++ b/_query-dsl/term/regexp.md @@ -61,5 +61,5 @@ Parameter | Data type | Description `max_determinized_states` | Integer | Lucene converts a regular expression to an automaton with a number of determinized states. This parameter specifies the maximum number of automaton states the query requires. Use this parameter to prevent high resource consumption. To run complex regular expressions, you may need to increase the value of this parameter. Default is 10,000. `rewrite` | String | Determines how OpenSearch rewrites and scores multi-term queries. Valid values are `constant_score`, `scoring_boolean`, `constant_score_boolean`, `top_terms_N`, `top_terms_boost_N`, and `top_terms_blended_freqs_N`. Default is `constant_score`. -If [`search.allow_expensive_queries`]({{site.url}}{{site.baseurl}}/query-dsl/index/#expensive-queries) is set to `false`, `regexp` queries are not run. +If [`search.allow_expensive_queries`]({{site.url}}{{site.baseurl}}/query-dsl/index/#expensive-queries) is set to `false`, then `regexp` queries are not executed. {: .important} diff --git a/_query-dsl/term/terms.md b/_query-dsl/term/terms.md index 42c74c0436c..7dac6a96190 100644 --- a/_query-dsl/term/terms.md +++ b/_query-dsl/term/terms.md @@ -39,6 +39,7 @@ Parameter | Data type | Description :--- | :--- | :--- `` | String | The field in which to search. A document is returned in the results only if its field value exactly matches at least one term, with the correct spacing and capitalization. `boost` | Floating-point | A floating-point value that specifies the weight of this field toward the relevance score. Values above 1.0 increase the field’s relevance. Values between 0.0 and 1.0 decrease the field’s relevance. Default is 1.0. +`value_type` | String | Specifies the types of values used for filtering. Valid values are `default` and `bitmap`. If omitted, the value defaults to `default`. ## Terms lookup @@ -250,3 +251,136 @@ Parameter | Data type | Description `path` | String | The name of the field from which to fetch field values. Specify nested fields using dot path notation. Required. `routing` | String | Custom routing value of the document from which to fetch field values. Optional. Required if a custom routing value was provided when the document was indexed. `boost` | Floating-point | A floating-point value that specifies the weight of this field toward the relevance score. Values above 1.0 increase the field’s relevance. Values between 0.0 and 1.0 decrease the field’s relevance. Default is 1.0. + +## Bitmap filtering +**Introduced 2.17** +{: .label .label-purple } + +The `terms` query can filter for multiple terms simultaneously. However, when the number of terms in the input filter increases to a large value (around 10,000), the resulting network and memory overhead can become significant, making the query inefficient. In such cases, consider encoding your large terms filter using a [roaring bitmap](https://github.com/RoaringBitmap/RoaringBitmap) for more efficient filtering. + +The following example assumes that you have two indexes: a `products` index, which contains all the products sold by a company, and a `customers` index, which stores filters representing customers who own specific products. + +First, create a `products` index and map `product_id` as a `keyword`: + +```json +PUT /products +{ + "mappings": { + "properties": { + "product_id": { "type": "keyword" } + } + } +} +``` +{% include copy-curl.html %} + +Next, index three documents that correspond to products: + +```json +PUT students/_doc/1 +{ + "name": "Product 1", + "product_id" : "111" +} +``` +{% include copy-curl.html %} + +```json +PUT students/_doc/2 +{ + "name": "Product 2", + "product_id" : "222" +} +``` +{% include copy-curl.html %} + +```json +PUT students/_doc/3 +{ + "name": "Product 3", + "product_id" : "333" +} +``` +{% include copy-curl.html %} + +To store customer bitmap filters, you'll create a `customer_filter` [binary field](https://opensearch.org/docs/latest/field-types/supported-field-types/binary/) in the `customers` index. Specify `store` as `true` to store the field: + +```json +PUT /customers +{ + "mappings": { + "properties": { + "customer_filter": { + "type": "binary", + "store": true + } + } + } +} +``` +{% include copy-curl.html %} + +For each customer, you need to generate a bitmap that represents the product IDs of the products the customer owns. This bitmap effectively encodes the filter criteria for that customer. In this example, you'll create a `terms` filter for a customer whose ID is `customer123` and who owns products `111`, `222`, and `333`. + +To encode a `terms` filter for the customer, first create a roaring bitmap for the filter. This example creates a bitmap using the [PyRoaringBitMap] library, so first run `pip install pyroaring` to install the library. Then serialize the bitmap and encode it using a [Base64](https://en.wikipedia.org/wiki/Base64) encoding scheme: + +```py +from pyroaring import BitMap +import base64 + +# Create a bitmap, serialize it into a byte string, and encode into Base64 +bm = BitMap([111, 222, 333]) # product ids owned by a customer +encoded = base64.b64encode(BitMap.serialize(bm)) + +# Convert the Base64-encoded bytes to a string for storage or transmission +encoded_bm_str = encoded.decode('utf-8') + +# Print the encoded bitmap +print(f"Encoded Bitmap: {encoded_bm_str}") +``` +{% include copy.html %} + +Next, index the customer filter into the `customers` index. The document ID for the filter is the same as the ID for the corresponding customer (in this example, `customer123`). The `customer_filter` field contains the bitmap you generated for this customer: + +```json +POST customers/_doc/customer123 +{ + "customer_filter": "OjAAAAEAAAAAAAIAEAAAAG8A3gBNAQ==" +} +``` +{% include copy-curl.html %} + +Now you can run a `terms` query on the `products` index to look up a specific customer in the `customers` index. Because you're looking up a stored field instead of `_source`, set `store` to `true`. In the `value_type` field, specify the data type of the `terms` input as `bitmap`: + +```json +POST /products/_search +{ + "query": { + "terms": { + "product_id": { + "index": "customers", + "id": "customer123", + "path": "customer_filter", + "store": true + }, + "value_type": "bitmap" + } + } +} +``` +{% include copy-curl.html %} + +You can also directly pass the bitmap to the `terms` query. In this example, the `product_id` field contains the customer filter bitmap for the customer whose ID is `customer123`: + +```json +POST /products/_search +{ + "query": { + "terms": { + "product_id": "OjAAAAEAAAAAAAIAEAAAAG8A3gBNAQ==", + "value_type": "bitmap" + } + } +} +``` +{% include copy-curl.html %} \ No newline at end of file diff --git a/_query-dsl/term/wildcard.md b/_query-dsl/term/wildcard.md index b2d72387582..c6e0499517f 100644 --- a/_query-dsl/term/wildcard.md +++ b/_query-dsl/term/wildcard.md @@ -64,5 +64,5 @@ Parameter | Data type | Description `case_insensitive` | Boolean | If `true`, allows case-insensitive matching of the value with the indexed field values. Default is `false` (case sensitivity is determined by the field's mapping). `rewrite` | String | Determines how OpenSearch rewrites and scores multi-term queries. Valid values are `constant_score`, `scoring_boolean`, `constant_score_boolean`, `top_terms_N`, `top_terms_boost_N`, and `top_terms_blended_freqs_N`. Default is `constant_score`. -If [`search.allow_expensive_queries`]({{site.url}}{{site.baseurl}}/query-dsl/index/#expensive-queries) is set to `false`, wildcard queries are not run. +If [`search.allow_expensive_queries`]({{site.url}}{{site.baseurl}}/query-dsl/index/#expensive-queries) is set to `false`, then wildcard queries are not executed. {: .important} diff --git a/_sass/color_schemes/odfe.scss b/_sass/color_schemes/odfe.scss deleted file mode 100644 index f9b2ca02ba3..00000000000 --- a/_sass/color_schemes/odfe.scss +++ /dev/null @@ -1,75 +0,0 @@ -// -// Brand colors -// - -$white: #FFFFFF; - -$grey-dk-300: #241F21; // Error -$grey-dk-250: mix(white, $grey-dk-300, 12.5%); -$grey-dk-200: mix(white, $grey-dk-300, 25%); -$grey-dk-100: mix(white, $grey-dk-300, 50%); -$grey-dk-000: mix(white, $grey-dk-300, 75%); - -$grey-lt-300: #DBDBDB; // Cloud -$grey-lt-200: mix(white, $grey-lt-300, 25%); -$grey-lt-100: mix(white, $grey-lt-300, 50%); -$grey-lt-000: mix(white, $grey-lt-300, 75%); - -$blue-300: #00007C; // Meta -$blue-200: mix(white, $blue-300, 25%); -$blue-100: mix(white, $blue-300, 50%); -$blue-000: mix(white, $blue-300, 75%); - -$purple-300: #9600FF; // Prpl -$purple-200: mix(white, $purple-300, 25%); -$purple-100: mix(white, $purple-300, 50%); -$purple-000: mix(white, $purple-300, 75%); - -$green-300: #00671A; // Element -$green-200: mix(white, $green-300, 25%); -$green-100: mix(white, $green-300, 50%); -$green-000: mix(white, $green-300, 75%); - -$yellow-300: #FFDF00; // Kan-Banana -$yellow-200: mix(white, $yellow-300, 25%); -$yellow-100: mix(white, $yellow-300, 50%); -$yellow-000: mix(white, $yellow-300, 75%); - -$red-300: #BD145A; // Ruby -$red-200: mix(white, $red-300, 25%); -$red-100: mix(white, $red-300, 50%); -$red-000: mix(white, $red-300, 75%); - -$blue-lt-300: #0000FF; // Cascade -$blue-lt-200: mix(white, $blue-lt-300, 25%); -$blue-lt-100: mix(white, $blue-lt-300, 50%); -$blue-lt-000: mix(white, $blue-lt-300, 75%); - -/* -Other, unused brand colors - -Float #2797F4 -Firewall #0FF006B -Hyper Pink #F261A1 -Cluster #ED20EB -Back End #808080 -Python #25EE5C -Warm Node #FEA501 -*/ - -$body-background-color: $white; -$sidebar-color: $grey-lt-000; -$code-background-color: $grey-lt-000; - -$body-text-color: $grey-dk-200; -$body-heading-color: $grey-dk-300; -$nav-child-link-color: $grey-dk-200; -$link-color: mix(black, $blue-lt-300, 37.5%); -$btn-primary-color: $purple-300; -$base-button-color: $grey-lt-000; - -// $border-color: $grey-dk-200; -// $search-result-preview-color: $grey-dk-000; -// $search-background-color: $grey-dk-250; -// $table-background-color: $grey-dk-250; -// $feedback-color: darken($sidebar-color, 3%); diff --git a/_search-plugins/collapse-search.md b/_search-plugins/collapse-search.md new file mode 100644 index 00000000000..ec7e57515ae --- /dev/null +++ b/_search-plugins/collapse-search.md @@ -0,0 +1,231 @@ +--- +layout: default +title: Collapse search results +nav_order: 3 +--- + +# Collapse search results + +The `collapse` parameter groups search results by a particular field value. This returns only the top document within each group, which helps reduce redundancy by eliminating duplicates. + +The `collapse` parameter requires the field being collapsed to be of either a `keyword` or a `numeric` type. + +--- + +## Collapsing search results + +To populate an index with data, define the index mappings and an `item` field indexed as a `keyword`. The following example request shows you how to define index mappings, populate an index, and then search it. + +#### Define index mappings + +```json +PUT /bakery-items +{ + "mappings": { + "properties": { + "item": { + "type": "keyword" + }, + "category": { + "type": "keyword" + }, + "price": { + "type": "float" + }, + "baked_date": { + "type": "date" + } + } + } +} +``` + +#### Populate an index + +```json +POST /bakery-items/_bulk +{ "index": {} } +{ "item": "Chocolate Cake", "category": "cakes", "price": 15, "baked_date": "2023-07-01T00:00:00Z" } +{ "index": {} } +{ "item": "Chocolate Cake", "category": "cakes", "price": 18, "baked_date": "2023-07-04T00:00:00Z" } +{ "index": {} } +{ "item": "Vanilla Cake", "category": "cakes", "price": 12, "baked_date": "2023-07-02T00:00:00Z" } +``` + +#### Search the index, returning all results + +```json +GET /bakery-items/_search +{ + "query": { + "match": { + "category": "cakes" + } + }, + "sort": ["price"] +} +``` + +This query returns the uncollapsed search results, showing all documents, including both entries for "Chocolate Cake". + +#### Search the index and collapse the results + +To group search results by the `item` field and sort them by `price`, you can use the following query: + +**Collapsed `item` field search results** + +```json +GET /bakery-items/_search +{ + "query": { + "match": { + "category": "cakes" + } + }, + "collapse": { + "field": "item" + }, + "sort": ["price"] +} +``` + +**Response** + +```json +{ + "took": 3, + "timed_out": false, + "_shards": { + "total": 1, + "successful": 1, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 4, + "relation": "eq" + }, + "max_score": null, + "hits": [ + { + "_index": "bakery-items", + "_id": "mISga5EB2HLDXHkv9kAr", + "_score": null, + "_source": { + "item": "Vanilla Cake", + "category": "cakes", + "price": 12, + "baked_date": "2023-07-02T00:00:00Z", + "baker": "Baker A" + }, + "fields": { + "item": [ + "Vanilla Cake" + ] + }, + "sort": [ + 12 + ] + }, + { + "_index": "bakery-items", + "_id": "loSga5EB2HLDXHkv9kAr", + "_score": null, + "_source": { + "item": "Chocolate Cake", + "category": "cakes", + "price": 15, + "baked_date": "2023-07-01T00:00:00Z", + "baker": "Baker A" + }, + "fields": { + "item": [ + "Chocolate Cake" + ] + }, + "sort": [ + 15 + ] + } + ] + } +} +``` + +The collapsed search results will show only one "Chocolate Cake" entry, demonstrating how the `collapse` parameter reduces redundancy. + +The `collapse` parameter affects only the top search results and does not change any aggregation results. The total number of hits shown in the response reflects all matching documents before the parameter is applied, including duplicates. However, the response doesn't indicate the exact number of unique groups formed by the operation. + +--- + +## Expanding collapsed results + +You can expand each collapsed top hit with the `inner_hits` property. + +The following example request applies `inner_hits` to retrieve the lowest-priced and most recent item, for each type of cake: + +```json +GET /bakery-items/_search +{ + "query": { + "match": { + "category": "cakes" + } + }, + "collapse": { + "field": "item", + "inner_hits": [ + { + "name": "cheapest_items", + "size": 1, + "sort": ["price"] + }, + { + "name": "newest_items", + "size": 1, + "sort": [{ "baked_date": "desc" }] + } + ] + }, + "sort": ["price"] +} + +``` + +### Multiple inner hits for each collapsed hit + +To obtain several groups of inner hits for each collapsed result, you can set different criteria for each group. For example, lets request the three most recent items for every bakery item: + +```json +GET /bakery-items/_search +{ + "query": { + "match": { + "category": "cakes" + } + }, + "collapse": { + "field": "item", + "inner_hits": [ + { + "name": "cheapest_items", + "size": 1, + "sort": ["price"] + }, + { + "name": "newest_items", + "size": 3, + "sort": [{ "baked_date": "desc" }] + } + ] + }, + "sort": ["price"] +} + + +``` +This query searches for documents in the `cakes` category and groups the search results by the `item_name` field. For each `item_name`, it retrieves the top three lowest-priced items and the top three most recent items, sorted by `baked_date` in descending order. + +You can expand the groups by sending an additional query for each inner hit request corresponding to each collapsed hit in the response. This can significantly slow down the process if there are too many groups or inner hit requests. The `max_concurrent_group_searches` request parameter can be used to control the maximum number of concurrent searches allowed in this phase. The default is based on the number of data nodes and the default search thread pool size. + diff --git a/_search-plugins/concurrent-segment-search.md b/_search-plugins/concurrent-segment-search.md index cbbb993ac97..80614e2fffe 100644 --- a/_search-plugins/concurrent-segment-search.md +++ b/_search-plugins/concurrent-segment-search.md @@ -22,6 +22,8 @@ Without concurrent segment search, Lucene executes a request sequentially across ## Enabling concurrent segment search at the index or cluster level +Starting with OpenSearch version 2.17, you can use the `search.concurrent_segment_search.mode` setting to configure concurrent segment search on your cluster. The existing `search.concurrent_segment_search.enabled` setting will be deprecated in future version releases in favor of the new setting. + By default, concurrent segment search is disabled on the cluster. You can enable concurrent segment search at two levels: - Cluster level @@ -30,8 +32,37 @@ By default, concurrent segment search is disabled on the cluster. You can enable The index-level setting takes priority over the cluster-level setting. Thus, if the cluster setting is enabled but the index setting is disabled, then concurrent segment search will be disabled for that index. Because of this, the index-level setting is not evaluated unless it is explicitly set, regardless of the default value configured for the setting. You can retrieve the current value of the index-level setting by calling the [Index Settings API]({{site.url}}{{site.baseurl}}/api-reference/index-apis/get-settings/) and omitting the `?include_defaults` query parameter. {: .note} -To enable concurrent segment search for all indexes in the cluster, set the following dynamic cluster setting: +Both the cluster- and index-level `search.concurrent_segment_search.mode` settings accept the following values: + +- `all`: Enables concurrent segment search across all search requests. This is equivalent to setting `search.concurrent_segment_search.enabled` to `true`. + +- `none`: Disables concurrent segment search for all search requests, effectively turning off the feature. This is equivalent to setting `search.concurrent_segment_search.enabled` to `false`. This is the **default** behavior. + +- `auto`: In this mode, OpenSearch will use the pluggable _concurrent search decider_ to decide whether to use a concurrent or sequential path for the search request based on the query evaluation and the presence of aggregations in the request. By default, if there are no deciders configured by any plugin, then the decision to use concurrent search will be made based on the presence of aggregations in the request. For more information about the pluggable decider semantics, see [Pluggable concurrent search deciders](#pluggable-concurrent-search-deciders-concurrentsearchrequestdecider). + +To enable concurrent segment search for all search requests across every index in the cluster, send the following request: +```json +PUT _cluster/settings +{ + "persistent":{ + "search.concurrent_segment_search.mode": "all" + } +} +``` +{% include copy-curl.html %} + +To enable concurrent segment search for all search requests on a particular index, specify the index name in the endpoint: + +```json +PUT /_settings +{ + "index.search.concurrent_segment_search.mode": "all" +} +``` +{% include copy-curl.html %} + +You can continue to use the existing `search.concurrent_segment_search.enabled` setting to enable concurrent segment search for all indexes in the cluster as follows: ```json PUT _cluster/settings { @@ -52,6 +83,35 @@ PUT /_settings ``` {% include copy-curl.html %} + +When evaluating whether concurrent segment search is enabled on a cluster, the `search.concurrent_segment_search.mode` setting takes precedence over the `search.concurrent_segment_search.enabled` setting. +If the `search.concurrent_segment_search.mode` setting is not explicitly set, then the `search.concurrent_segment_search.enabled` setting will be evaluated to determine whether to enable concurrent segment search. + +When upgrading a cluster from an earlier version that specifies the older `search.concurrent_segment_search.enabled` setting, this setting will continue to be honored. However, once the `search.concurrent_segment_search.mode` is set, it will override the previous setting, enabling or disabling concurrent search based on the specified mode. +We recommend setting `search.concurrent_segment_search.enabled` to `null` on your cluster once you configure `search.concurrent_segment_search.mode`: + +```json +PUT _cluster/settings +{ + "persistent":{ + "search.concurrent_segment_search.enabled": null + } +} +``` +{% include copy-curl.html %} + +To disable the old setting for a particular index, specify the index name in the endpoint: +```json +PUT /_settings +{ + "index.search.concurrent_segment_search.enabled": null +} +``` +{% include copy-curl.html %} + + + + ## Slicing mechanisms You can choose one of two available mechanisms for assigning segments to slices: the default [Lucene mechanism](#the-lucene-mechanism) or the [max slice count mechanism](#the-max-slice-count-mechanism). @@ -66,7 +126,10 @@ The _max slice count_ mechanism is an alternative slicing mechanism that uses a ### Setting the slicing mechanism -By default, concurrent segment search uses the Lucene mechanism to calculate the number of slices for each shard-level request. To use the max slice count mechanism instead, configure the `search.concurrent.max_slice_count` cluster setting: +By default, concurrent segment search uses the Lucene mechanism to calculate the number of slices for each shard-level request. +To use the max slice count mechanism instead, you can set the slice count for concurrent segment search at either the cluster level or index level. + +To configure the slice count for all indexes in a cluster, use the following dynamic cluster setting: ```json PUT _cluster/settings @@ -78,7 +141,17 @@ PUT _cluster/settings ``` {% include copy-curl.html %} -The `search.concurrent.max_slice_count` setting can take the following valid values: +To configure the slice count for a particular index, specify the index name in the endpoint: + +```json +PUT /_settings +{ + "index.search.concurrent.max_slice_count": 2 +} +``` +{% include copy-curl.html %} + +Both the cluster- and index-level `search.concurrent.max_slice_count` settings can take the following valid values: - `0`: Use the default Lucene mechanism. - Positive integer: Use the max target slice count mechanism. Usually, a value between 2 and 8 should be sufficient. @@ -117,8 +190,20 @@ Non-concurrent search calculates the document count error and returns it in the For more information about how `shard_size` can affect both `doc_count_error_upper_bound` and collected buckets, see [this GitHub issue](https://github.com/opensearch-project/OpenSearch/issues/11680#issuecomment-1885882985). -## Developer information: AggregatorFactory changes +## Developer information + +The following sections provide additional information for developers. + +### AggregatorFactory changes + +Because of implementation details, not all aggregator types can support concurrent segment search. To accommodate this, we have introduced a [`supportsConcurrentSegmentSearch()`](https://github.com/opensearch-project/OpenSearch/blob/2.x/server/src/main/java/org/opensearch/search/aggregations/AggregatorFactory.java#L123) method in the `AggregatorFactory` class to indicate whether a given aggregation type supports concurrent segment search. By default, this method returns `false`. Any aggregator that needs to support concurrent segment search must override this method in its own factory implementation. + +To ensure that a custom plugin-based `Aggregator` implementation functions with the concurrent search path, plugin developers can verify their implementation with concurrent search enabled and then update the plugin to override the [`supportsConcurrentSegmentSearch()`](https://github.com/opensearch-project/OpenSearch/blob/2.x/server/src/main/java/org/opensearch/search/aggregations/AggregatorFactory.java#L123) method to return `true`. + +### Pluggable concurrent search deciders: ConcurrentSearchRequestDecider -Because of implementation details, not all aggregator types can support concurrent segment search. To accommodate this, we have introduced a [`supportsConcurrentSegmentSearch()`](https://github.com/opensearch-project/OpenSearch/blob/bb38ed4836496ac70258c2472668325a012ea3ed/server/src/main/java/org/opensearch/search/aggregations/AggregatorFactory.java#L121) method in the `AggregatorFactory` class to indicate whether a given aggregation type supports concurrent segment search. By default, this method returns `false`. Any aggregator that needs to support concurrent segment search must override this method in its own factory implementation. +Introduced 2.17 +{: .label .label-purple } -To ensure that a custom plugin-based `Aggregator` implementation works with the concurrent search path, plugin developers can verify their implementation with concurrent search enabled and then update the plugin to override the [`supportsConcurrentSegmentSearch()`](https://github.com/opensearch-project/OpenSearch/blob/bb38ed4836496ac70258c2472668325a012ea3ed/server/src/main/java/org/opensearch/search/aggregations/AggregatorFactory.java#L121) method to return `true`. +Plugin developers can customize the concurrent search decision-making for `auto` mode by extending [`ConcurrentSearchRequestDecider`](https://github.com/opensearch-project/OpenSearch/blob/2.x/server/src/main/java/org/opensearch/search/deciders/ConcurrentSearchRequestDecider.java) and registering its factory through [`SearchPlugin#getConcurrentSearchRequestFactories()`](https://github.com/opensearch-project/OpenSearch/blob/2.x/server/src/main/java/org/opensearch/plugins/SearchPlugin.java#L148). The deciders are evaluated only if a request does not belong to any category listed in the [Limitations](#limitations) and [Other considerations](#other-considerations) sections. For more information about the decider implementation, see [the corresponding GitHub issue](https://github.com/opensearch-project/OpenSearch/issues/15259). +The search request is parsed using a `QueryBuilderVisitor`, which calls the [`ConcurrentSearchRequestDecider#evaluateForQuery()`](https://github.com/opensearch-project/OpenSearch/blob/2.x/server/src/main/java/org/opensearch/search/deciders/ConcurrentSearchRequestDecider.java#L36) method of all the configured deciders for every node of the `QueryBuilder` tree in the search request. The final concurrent search decision is obtained by combining the decision from each decider returned by the [`ConcurrentSearchRequestDecider#getConcurrentSearchDecision()`](https://github.com/opensearch-project/OpenSearch/blob/2.x/server/src/main/java/org/opensearch/search/deciders/ConcurrentSearchRequestDecider.java#L44) method. \ No newline at end of file diff --git a/_search-plugins/index.md b/_search-plugins/index.md index 79e0e715d04..3604245f11a 100644 --- a/_search-plugins/index.md +++ b/_search-plugins/index.md @@ -16,29 +16,31 @@ OpenSearch provides many features for customizing your search use cases and impr ## Search methods -OpenSearch supports the following search methods: +OpenSearch supports the following search methods. -- **Traditional lexical search** +### Traditional lexical search - - [Keyword (BM25) search]({{site.url}}{{site.baseurl}}/search-plugins/keyword-search/): Searches the document corpus for words that appear in the query. +OpenSearch supports [keyword (BM25) search]({{site.url}}{{site.baseurl}}/search-plugins/keyword-search/), which searches the document corpus for words that appear in the query. -- **Machine learning (ML)-powered search** +### ML-powered search - - **Vector search** +OpenSearch supports the following machine learning (ML)-powered search methods: - - [k-NN search]({{site.url}}{{site.baseurl}}/search-plugins/knn/): Searches for k-nearest neighbors to a search term across an index of vectors. +- **Vector search** - - **Neural search**: [Neural search]({{site.url}}{{site.baseurl}}/search-plugins/neural-search/) facilitates generating vector embeddings at ingestion time and searching them at search time. Neural search lets you integrate ML models into your search and serves as a framework for implementing other search methods. The following search methods are built on top of neural search: + - [k-NN search]({{site.url}}{{site.baseurl}}/search-plugins/knn/): Searches for the k-nearest neighbors to a search term across an index of vectors. - - [Semantic search]({{site.url}}{{site.baseurl}}/search-plugins/semantic-search/): Considers the meaning of the words in the search context. Uses dense retrieval based on text embedding models to search text data. +- **Neural search**: [Neural search]({{site.url}}{{site.baseurl}}/search-plugins/neural-search/) facilitates generating vector embeddings at ingestion time and searching them at search time. Neural search lets you integrate ML models into your search and serves as a framework for implementing other search methods. The following search methods are built on top of neural search: - - [Multimodal search]({{site.url}}{{site.baseurl}}/search-plugins/multimodal-search/): Uses multimodal embedding models to search text and image data. + - [Semantic search]({{site.url}}{{site.baseurl}}/search-plugins/semantic-search/): Considers the meaning of the words in the search context. Uses dense retrieval based on text embedding models to search text data. - - [Neural sparse search]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/): Uses sparse retrieval based on sparse embedding models to search text data. + - [Multimodal search]({{site.url}}{{site.baseurl}}/search-plugins/multimodal-search/): Uses multimodal embedding models to search text and image data. - - [Hybrid search]({{site.url}}{{site.baseurl}}/search-plugins/hybrid-search/): Combines traditional search and vector search to improve search relevance. + - [Neural sparse search]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/): Uses sparse retrieval based on sparse embedding models to search text data. - - [Conversational search]({{site.url}}{{site.baseurl}}/search-plugins/conversational-search/): Implements a retrieval-augmented generative search. + - [Hybrid search]({{site.url}}{{site.baseurl}}/search-plugins/hybrid-search/): Combines traditional search and vector search to improve search relevance. + + - [Conversational search]({{site.url}}{{site.baseurl}}/search-plugins/conversational-search/): Implements a retrieval-augmented generative search. ## Query languages diff --git a/_search-plugins/knn/api.md b/_search-plugins/knn/api.md index c7314f7ae29..1a6c9706400 100644 --- a/_search-plugins/knn/api.md +++ b/_search-plugins/knn/api.md @@ -234,7 +234,7 @@ Response field | Description `timestamp` | The date and time when the model was created. `description` | A user-provided description of the model. `error` | An error message explaining why the model is in a failed state. -`space_type` | The space type for which this model is trained, for example, Euclidean or cosine. +`space_type` | The space type for which this model is trained, for example, Euclidean or cosine. Note - this value can be set in the top-level of the request as well `dimension` | The dimensionality of the vector space for which this model is designed. `engine` | The native library used to create the model, either `faiss` or `nmslib`. @@ -351,6 +351,7 @@ Request parameter | Description `search_size` | The training data is pulled from the training index using scroll queries. This parameter defines the number of results to return per scroll query. Default is `10000`. Optional. `description` | A user-provided description of the model. Optional. `method` | The configuration of the approximate k-NN method used for search operations. For more information about the available methods, see [k-NN index method definitions]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index#method-definitions). The method requires training to be valid. +`space_type` | The space type for which this model is trained, for example, Euclidean or cosine. Note: This value can also be set in the `method` parameter. #### Usage @@ -365,10 +366,10 @@ POST /_plugins/_knn/models/{model_id}/_train?preference={node_id} "max_training_vector_count": 1200, "search_size": 100, "description": "My model", + "space_type": "l2", "method": { "name":"ivf", "engine":"faiss", - "space_type": "l2", "parameters":{ "nlist":128, "encoder":{ @@ -395,10 +396,10 @@ POST /_plugins/_knn/models/_train?preference={node_id} "max_training_vector_count": 1200, "search_size": 100, "description": "My model", + "space_type": "l2", "method": { "name":"ivf", "engine":"faiss", - "space_type": "l2", "parameters":{ "nlist":128, "encoder":{ diff --git a/_search-plugins/knn/approximate-knn.md b/_search-plugins/knn/approximate-knn.md index e9cff8562f7..f8921033e06 100644 --- a/_search-plugins/knn/approximate-knn.md +++ b/_search-plugins/knn/approximate-knn.md @@ -49,9 +49,9 @@ PUT my-knn-index-1 "my_vector1": { "type": "knn_vector", "dimension": 2, + "space_type": "l2", "method": { "name": "hnsw", - "space_type": "l2", "engine": "nmslib", "parameters": { "ef_construction": 128, @@ -62,9 +62,9 @@ PUT my-knn-index-1 "my_vector2": { "type": "knn_vector", "dimension": 4, + "space_type": "innerproduct", "method": { "name": "hnsw", - "space_type": "innerproduct", "engine": "faiss", "parameters": { "ef_construction": 256, @@ -199,10 +199,10 @@ POST /_plugins/_knn/models/my-model/_train "training_field": "train-field", "dimension": 4, "description": "My model description", + "space_type": "l2", "method": { "name": "ivf", "engine": "faiss", - "space_type": "l2", "parameters": { "nlist": 4, "nprobes": 2 @@ -308,6 +308,72 @@ Engine | Notes :--- | :--- `faiss` | If `nprobes` is present in a query, it overrides the value provided when creating the index. +### Rescoring quantized results using full precision + +Quantization can be used to significantly reduce the memory footprint of a k-NN index. For more information about quantization, see [k-NN vector quantization]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-vector-quantization). Because some vector representation is lost during quantization, the computed distances will be approximate. This causes the overall recall of the search to decrease. + +To improve recall while maintaining the memory savings of quantization, you can use a two-phase search approach. In the first phase, `oversample_factor * k` results are retrieved from an index using quantized vectors and the scores are approximated. In the second phase, the full-precision vectors of those `oversample_factor * k` results are loaded into memory from disk, and scores are recomputed against the full-precision query vector. The results are then reduced to the top k. + +The default rescoring behavior is determined by the `mode` and `compression_level` of the backing k-NN vector field: + +- For `in_memory` mode, no rescoring is applied by default. +- For `on_disk` mode, default rescoring is based on the configured `compression_level`. Each `compression_level` provides a default `oversample_factor`, specified in the following table. + +| Compression level | Default rescore `oversample_factor` | +|:------------------|:----------------------------------| +| `32x` (default) | 3.0 | +| `16x` | 2.0 | +| `8x` | 2.0 | +| `4x` | No default rescoring | +| `2x` | No default rescoring | + +To explicitly apply rescoring, provide the `rescore` parameter in a query on a quantized index and specify the `oversample_factor`: + +```json +GET my-knn-index-1/_search +{ + "size": 2, + "query": { + "knn": { + "target-field": { + "vector": [2, 3, 5, 6], + "k": 2, + "rescore" : { + "oversample_factor": 1.2 + } + } + } + } +} +``` +{% include copy-curl.html %} + +Alternatively, set the `rescore` parameter to `true` to use a default `oversample_factor` of `1.0`: + +```json +GET my-knn-index-1/_search +{ + "size": 2, + "query": { + "knn": { + "target-field": { + "vector": [2, 3, 5, 6], + "k": 2, + "rescore" : true + } + } + } +} +``` +{% include copy-curl.html %} + +The `oversample_factor` is a floating-point number between 1.0 and 100.0, inclusive. The number of results in the first pass is calculated as `oversample_factor * k` and is guaranteed to be between 100 and 10,000, inclusive. If the calculated number of results is smaller than 100, then the number of results is set to 100. If the calculated number of results is greater than 10,000, then the number of results is set to 10,000. + +Rescoring is only supported for the `faiss` engine. + +Rescoring is not needed if quantization is not used because the scores returned are already fully precise. +{: .note} + ### Using approximate k-NN with filters To learn about using filters with k-NN search, see [k-NN search with filters]({{site.url}}{{site.baseurl}}/search-plugins/knn/filter-search-knn/). @@ -322,7 +388,7 @@ To learn more about the radial search feature, see [k-NN radial search]({{site.u ### Using approximate k-NN with binary vectors -To learn more about using binary vectors with k-NN search, see [Binary k-NN vectors]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-k-nn-vectors). +To learn more about using binary vectors with k-NN search, see [Binary k-NN vectors]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-vectors). ## Spaces @@ -346,5 +412,5 @@ The cosine similarity formula does not include the `1 -` prefix. However, becaus With cosine similarity, it is not valid to pass a zero vector (`[0, 0, ...]`) as input. This is because the magnitude of such a vector is 0, which raises a `divide by 0` exception in the corresponding formula. Requests containing the zero vector will be rejected, and a corresponding exception will be thrown. {: .note } -The `hamming` space type is supported for binary vectors in OpenSearch version 2.16 and later. For more information, see [Binary k-NN vectors]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-k-nn-vectors). +The `hamming` space type is supported for binary vectors in OpenSearch version 2.16 and later. For more information, see [Binary k-NN vectors]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-vectors). {: .note} diff --git a/_search-plugins/knn/disk-based-vector-search.md b/_search-plugins/knn/disk-based-vector-search.md new file mode 100644 index 00000000000..dfb9262db5e --- /dev/null +++ b/_search-plugins/knn/disk-based-vector-search.md @@ -0,0 +1,193 @@ +--- +layout: default +title: Disk-based vector search +nav_order: 16 +parent: k-NN search +has_children: false +--- + +# Disk-based vector search +**Introduced 2.17** +{: .label .label-purple} + +For low-memory environments, OpenSearch provides _disk-based vector search_, which significantly reduces the operational costs for vector workloads. Disk-based vector search uses [binary quantization]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-vector-quantization/#binary-quantization), compressing vectors and thereby reducing the memory requirements. This memory optimization provides large memory savings at the cost of slightly increased search latency while still maintaining strong recall. + +To use disk-based vector search, set the [`mode`]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector/#vector-workload-modes) parameter to `on_disk` for your vector field type. This parameter will configure your index to use secondary storage. + +## Creating an index for disk-based vector search + +To create an index for disk-based vector search, send the following request: + +```json +PUT my-vector-index +{ + "mappings": { + "properties": { + "my_vector_field": { + "type": "knn_vector", + "dimension": 8, + "space_type": "innerproduct", + "data_type": "float", + "mode": "on_disk" + } + } + } +} +``` +{% include copy-curl.html %} + +By default, the `on_disk` mode configures the index to use the `faiss` engine and `hnsw` method. The default [`compression_level`]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector/#compression-levels) of `32x` reduces the amount of memory the vectors require by a factor of 32. To preserve the search recall, rescoring is enabled by default. A search on a disk-optimized index runs in two phases: The compressed index is searched first, and then the results are rescored using full-precision vectors loaded from disk. + +To reduce the compression level, provide the `compression_level` parameter when creating the index mapping: + +```json +PUT my-vector-index +{ + "mappings": { + "properties": { + "my_vector_field": { + "type": "knn_vector", + "dimension": 8, + "space_type": "innerproduct", + "data_type": "float", + "mode": "on_disk", + "compression_level": "16x" + } + } + } +} +``` +{% include copy-curl.html %} + +For more information about the `compression_level` parameter, see [Compression levels]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector/#compression-levels). Note that for `4x` compression, the `lucene` engine will be used. +{: .note} + +If you need more granular fine-tuning, you can override additional k-NN parameters in the method definition. For example, to improve recall, increase the `ef_construction` parameter value: + +```json +PUT my-vector-index +{ + "mappings": { + "properties": { + "my_vector_field": { + "type": "knn_vector", + "dimension": 8, + "space_type": "innerproduct", + "data_type": "float", + "mode": "on_disk", + "method": { + "params": { + "ef_construction": 512 + } + } + } + } + } +} +``` +{% include copy-curl.html %} + +The `on_disk` mode only works with the `float` data type. +{: .note} + +## Ingestion + +You can perform document ingestion for a disk-optimized vector index in the same way as for a regular vector index. To index several documents in bulk, send the following request: + +```json +POST _bulk +{ "index": { "_index": "my-vector-index", "_id": "1" } } +{ "my_vector_field": [1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5], "price": 12.2 } +{ "index": { "_index": "my-vector-index", "_id": "2" } } +{ "my_vector_field": [2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5], "price": 7.1 } +{ "index": { "_index": "my-vector-index", "_id": "3" } } +{ "my_vector_field": [3.5, 3.5, 3.5, 3.5, 3.5, 3.5, 3.5, 3.5], "price": 12.9 } +{ "index": { "_index": "my-vector-index", "_id": "4" } } +{ "my_vector_field": [4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5], "price": 1.2 } +{ "index": { "_index": "my-vector-index", "_id": "5" } } +{ "my_vector_field": [5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5], "price": 3.7 } +{ "index": { "_index": "my-vector-index", "_id": "6" } } +{ "my_vector_field": [6.5, 6.5, 6.5, 6.5, 6.5, 6.5, 6.5, 6.5], "price": 10.3 } +{ "index": { "_index": "my-vector-index", "_id": "7" } } +{ "my_vector_field": [7.5, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5], "price": 5.5 } +{ "index": { "_index": "my-vector-index", "_id": "8" } } +{ "my_vector_field": [8.5, 8.5, 8.5, 8.5, 8.5, 8.5, 8.5, 8.5], "price": 4.4 } +{ "index": { "_index": "my-vector-index", "_id": "9" } } +{ "my_vector_field": [9.5, 9.5, 9.5, 9.5, 9.5, 9.5, 9.5, 9.5], "price": 8.9 } +``` +{% include copy-curl.html %} + +## Search + +Search is also performed in the same way as in other index configurations. The key difference is that, by default, the `oversample_factor` of the rescore parameter is set to `3.0` (unless you override the `compression_level`). For more information, see [Rescoring quantized results using full precision]({{site.url}}{{site.baseurl}}/search-plugins/knn/approximate-knn/#rescoring-quantized-results-using-full-precision). To perform vector search on a disk-optimized index, provide the search vector: + +```json +GET my-vector-index/_search +{ + "query": { + "knn": { + "my_vector_field": { + "vector": [1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5], + "k": 5 + } + } + } +} +``` +{% include copy-curl.html %} + +Similarly to other index configurations, you can override k-NN parameters in the search request: + +```json +GET my-vector-index/_search +{ + "query": { + "knn": { + "my_vector_field": { + "vector": [1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5], + "k": 5, + "method_parameters": { + "ef_search": 512 + }, + "rescore": { + "oversample_factor": 10.0 + } + } + } + } +} +``` +{% include copy-curl.html %} + +[Radial search]({{site.url}}{{site.baseurl}}/search-plugins/knn/radial-search-knn/) does not support disk-based vector search. +{: .note} + +## Model-based indexes + +For [model-based indexes]({{site.url}}{{site.baseurl}}/search-plugins/knn/approximate-knn/#building-a-k-nn-index-from-a-model), you can specify the `on_disk` parameter in the training request in the same way that you would specify it during index creation. By default, `on_disk` mode will use the [Faiss IVF method]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index/#supported-faiss-methods) and a compression level of `32x`. To run the training API, send the following request: + +```json +POST /_plugins/_knn/models/test-model/_train +{ + "training_index": "train-index-name", + "training_field": "train-field-name", + "dimension": 8, + "max_training_vector_count": 1200, + "search_size": 100, + "description": "My model", + "space_type": "innerproduct", + "mode": "on_disk" +} +``` +{% include copy-curl.html %} + +This command assumes that training data has been ingested into the `train-index-name` index. For more information, see [Building a k-NN index from a model]({{site.url}}{{site.baseurl}}/search-plugins/knn/approximate-knn/#building-a-k-nn-index-from-a-model). +{: .note} + +You can override the `compression_level` for disk-optimized indexes in the same way as for regular k-NN indexes. + + +## Next steps + +- For more information about binary quantization, see [Binary quantization]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-vector-quantization/#binary-quantization). +- For more information about k-NN vector workload modes, see [Vector workload modes]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector/#vector-workload-modes). \ No newline at end of file diff --git a/_search-plugins/knn/knn-index.md b/_search-plugins/knn/knn-index.md index 3fb43cfe721..0cfb27397ef 100644 --- a/_search-plugins/knn/knn-index.md +++ b/_search-plugins/knn/knn-index.md @@ -25,9 +25,9 @@ PUT /test-index "my_vector1": { "type": "knn_vector", "dimension": 3, + "space_type": "l2", "method": { "name": "hnsw", - "space_type": "l2", "engine": "lucene", "parameters": { "ef_construction": 128, @@ -41,13 +41,13 @@ PUT /test-index ``` {% include copy-curl.html %} -## Lucene byte vector +## Byte vectors -Starting with k-NN plugin version 2.9, you can use `byte` vectors with the `lucene` engine to reduce the amount of storage space needed. For more information, see [Lucene byte vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#lucene-byte-vector). +Starting with k-NN plugin version 2.17, you can use `byte` vectors with the `faiss` and `lucene` engines to reduce the amount of required memory and storage space. For more information, see [Byte vectors]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#byte-vectors). -## Binary vector +## Binary vectors -Starting with k-NN plugin version 2.16, you can use `binary` vectors with the `faiss` engine to reduce the amount of required storage space. For more information, see [Binary k-NN vectors]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-k-nn-vectors). +Starting with k-NN plugin version 2.16, you can use `binary` vectors with the `faiss` engine to reduce the amount of required storage space. For more information, see [Binary vectors]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-vectors). ## SIMD optimization for the Faiss engine @@ -91,7 +91,7 @@ A method definition will always contain the name of the method, the space_type t Mapping parameter | Required | Default | Updatable | Description :--- | :--- | :--- | :--- | :--- `name` | true | n/a | false | The identifier for the nearest neighbor method. -`space_type` | false | l2 | false | The vector space used to calculate the distance between vectors. +`space_type` | false | l2 | false | The vector space used to calculate the distance between vectors. Note: This value can also be specified at the top level of the mapping. `engine` | false | nmslib | false | The approximate k-NN library to use for indexing and search. The available libraries are faiss, nmslib, and Lucene. `parameters` | false | null | false | The parameters used for the nearest neighbor method. @@ -124,7 +124,7 @@ Method name | Requires training | Supported spaces | Description For hnsw, "innerproduct" is not available when PQ is used. {: .note} -The `hamming` space type is supported for binary vectors in OpenSearch version 2.16 and later. For more information, see [Binary k-NN vectors]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-k-nn-vectors). +The `hamming` space type is supported for binary vectors in OpenSearch version 2.16 and later. For more information, see [Binary k-NN vectors]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-vectors). {: .note} #### HNSW parameters @@ -176,7 +176,6 @@ An index created in OpenSearch version 2.11 or earlier will still use the old `e "method": { "name":"hnsw", "engine":"lucene", - "space_type": "l2", "parameters":{ "m":2048, "ef_construction": 245 @@ -194,7 +193,6 @@ The following example method definition specifies the `hnsw` method and a `pq` e "method": { "name":"hnsw", "engine":"faiss", - "space_type": "l2", "parameters":{ "encoder":{ "name":"pq", @@ -240,7 +238,6 @@ The following example uses the `ivf` method without specifying an encoder (by d "method": { "name":"ivf", "engine":"faiss", - "space_type": "l2", "parameters":{ "nlist": 4, "nprobes": 2 @@ -254,7 +251,6 @@ The following example uses the `ivf` method with a `pq` encoder: "method": { "name":"ivf", "engine":"faiss", - "space_type": "l2", "parameters":{ "encoder":{ "name":"pq", @@ -273,7 +269,6 @@ The following example uses the `hnsw` method without specifying an encoder (by d "method": { "name":"hnsw", "engine":"faiss", - "space_type": "l2", "parameters":{ "ef_construction": 256, "m": 8 @@ -287,7 +282,6 @@ The following example uses the `hnsw` method with an `sq` encoder of type `fp16` "method": { "name":"hnsw", "engine":"faiss", - "space_type": "l2", "parameters":{ "encoder": { "name": "sq", @@ -308,7 +302,6 @@ The following example uses the `ivf` method with an `sq` encoder of type `fp16`: "method": { "name":"ivf", "engine":"faiss", - "space_type": "l2", "parameters":{ "encoder": { "name": "sq", @@ -332,7 +325,7 @@ If you want to use less memory and increase indexing speed as compared to HNSW w If memory is a concern, consider adding a PQ encoder to your HNSW or IVF index. Because PQ is a lossy encoding, query quality will drop. -You can reduce the memory footprint by a factor of 2, with a minimal loss in search quality, by using the [`fp_16` encoder]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-vector-quantization/#faiss-16-bit-scalar-quantization). If your vector dimensions are within the [-128, 127] byte range, we recommend using the [byte quantizer]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector/#lucene-byte-vector) to reduce the memory footprint by a factor of 4. To learn more about vector quantization options, see [k-NN vector quantization]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-vector-quantization/). +You can reduce the memory footprint by a factor of 2, with a minimal loss in search quality, by using the [`fp_16` encoder]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-vector-quantization/#faiss-16-bit-scalar-quantization). If your vector dimensions are within the [-128, 127] byte range, we recommend using the [byte quantizer]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector/#byte-vectors) to reduce the memory footprint by a factor of 4. To learn more about vector quantization options, see [k-NN vector quantization]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-vector-quantization/). ### Memory estimation diff --git a/_search-plugins/knn/knn-score-script.md b/_search-plugins/knn/knn-score-script.md index d2fd883e74a..a184de2d3d4 100644 --- a/_search-plugins/knn/knn-score-script.md +++ b/_search-plugins/knn/knn-score-script.md @@ -302,5 +302,5 @@ Cosine similarity returns a number between -1 and 1, and because OpenSearch rele With cosine similarity, it is not valid to pass a zero vector (`[0, 0, ... ]`) as input. This is because the magnitude of such a vector is 0, which raises a `divide by 0` exception in the corresponding formula. Requests containing the zero vector will be rejected, and a corresponding exception will be thrown. {: .note } -The `hamming` space type is supported for binary vectors in OpenSearch version 2.16 and later. For more information, see [Binary k-NN vectors]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-k-nn-vectors). +The `hamming` space type is supported for binary vectors in OpenSearch version 2.16 and later. For more information, see [Binary k-NN vectors]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-vectors). {: .note} diff --git a/_search-plugins/knn/knn-vector-quantization.md b/_search-plugins/knn/knn-vector-quantization.md index 656ce72fd22..a911dc91c98 100644 --- a/_search-plugins/knn/knn-vector-quantization.md +++ b/_search-plugins/knn/knn-vector-quantization.md @@ -11,15 +11,15 @@ has_math: true By default, the k-NN plugin supports the indexing and querying of vectors of type `float`, where each dimension of the vector occupies 4 bytes of memory. For use cases that require ingestion on a large scale, keeping `float` vectors can be expensive because OpenSearch needs to construct, load, save, and search graphs (for native `nmslib` and `faiss` engines). To reduce the memory footprint, you can use vector quantization. -OpenSearch supports many varieties of quantization. In general, the level of quantization will provide a trade-off between the accuracy of the nearest neighbor search and the size of the memory footprint consumed by the vector search. The supported types include byte vectors, 16-bit scalar quantization, and product quantization (PQ). +OpenSearch supports many varieties of quantization. In general, the level of quantization will provide a trade-off between the accuracy of the nearest neighbor search and the size of the memory footprint consumed by the vector search. The supported types include byte vectors, 16-bit scalar quantization, product quantization (PQ), and binary quantization(BQ). -## Lucene byte vector +## Byte vectors -Starting with k-NN plugin version 2.9, you can use `byte` vectors with the Lucene engine in order to reduce the amount of required memory. This requires quantizing the vectors outside of OpenSearch before ingesting them into an OpenSearch index. For more information, see [Lucene byte vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#lucene-byte-vector). +Starting with version 2.17, the k-NN plugin supports `byte` vectors with the `faiss` and `lucene` engines in order to reduce the amount of required memory. This requires quantizing the vectors outside of OpenSearch before ingesting them into an OpenSearch index. For more information, see [Byte vectors]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#byte-vectors). ## Lucene scalar quantization -Starting with version 2.16, the k-NN plugin supports built-in scalar quantization for the Lucene engine. Unlike the [Lucene byte vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#lucene-byte-vector), which requires you to quantize vectors before ingesting the documents, the Lucene scalar quantizer quantizes input vectors in OpenSearch during ingestion. The Lucene scalar quantizer converts 32-bit floating-point input vectors into 7-bit integer vectors in each segment using the minimum and maximum quantiles computed based on the [`confidence_interval`](#confidence-interval) parameter. During search, the query vector is quantized in each segment using the segment's minimum and maximum quantiles in order to compute the distance between the query vector and the segment's quantized input vectors. +Starting with version 2.16, the k-NN plugin supports built-in scalar quantization for the Lucene engine. Unlike [byte vectors]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#byte-vectors), which require you to quantize vectors before ingesting documents, the Lucene scalar quantizer quantizes input vectors in OpenSearch during ingestion. The Lucene scalar quantizer converts 32-bit floating-point input vectors into 7-bit integer vectors in each segment using the minimum and maximum quantiles computed based on the [`confidence_interval`](#confidence-interval) parameter. During search, the query vector is quantized in each segment using the segment's minimum and maximum quantiles in order to compute the distance between the query vector and the segment's quantized input vectors. Quantization can decrease the memory footprint by a factor of 4 in exchange for some loss in recall. Additionally, quantization slightly increases disk usage because it requires storing both the raw input vectors and the quantized vectors. @@ -40,10 +40,10 @@ PUT /test-index "my_vector1": { "type": "knn_vector", "dimension": 2, + "space_type": "l2", "method": { "name": "hnsw", "engine": "lucene", - "space_type": "l2", "parameters": { "encoder": { "name": "sq" @@ -85,10 +85,10 @@ PUT /test-index "my_vector1": { "type": "knn_vector", "dimension": 2, + "space_type": "l2", "method": { "name": "hnsw", "engine": "lucene", - "space_type": "l2", "parameters": { "encoder": { "name": "sq", @@ -115,7 +115,7 @@ In the ideal scenario, 7-bit vectors created by the Lucene scalar quantizer use #### HNSW memory estimation -The memory required for the Hierarchical Navigable Small World (HNSW) graph can be estimated as `1.1 * (dimension + 8 * M)` bytes/vector, where `M` is the maximum number of bidirectional links created for each element during the construction of the graph. +The memory required for the Hierarchical Navigable Small World (HNSW) graph can be estimated as `1.1 * (dimension + 8 * m)` bytes/vector, where `m` is the maximum number of bidirectional links created for each element during the construction of the graph. As an example, assume that you have 1 million vectors with a dimension of 256 and M of 16. The memory requirement can be estimated as follows: @@ -150,10 +150,10 @@ PUT /test-index "my_vector1": { "type": "knn_vector", "dimension": 3, + "space_type": "l2", "method": { "name": "hnsw", "engine": "faiss", - "space_type": "l2", "parameters": { "encoder": { "name": "sq" @@ -194,10 +194,10 @@ PUT /test-index "my_vector1": { "type": "knn_vector", "dimension": 3, + "space_type": "l2", "method": { "name": "hnsw", "engine": "faiss", - "space_type": "l2", "parameters": { "encoder": { "name": "sq", @@ -250,9 +250,9 @@ In the best-case scenario, 16-bit vectors produced by the Faiss SQfp16 quantizer #### HNSW memory estimation -The memory required for Hierarchical Navigable Small Worlds (HNSW) is estimated to be `1.1 * (2 * dimension + 8 * M)` bytes/vector. +The memory required for Hierarchical Navigable Small Worlds (HNSW) is estimated to be `1.1 * (2 * dimension + 8 * m)` bytes/vector, where `m` is the maximum number of bidirectional links created for each element during the construction of the graph. -As an example, assume that you have 1 million vectors with a dimension of 256 and M of 16. The memory requirement can be estimated as follows: +As an example, assume that you have 1 million vectors with a dimension of 256 and an `m` of 16. The memory requirement can be estimated as follows: ```r 1.1 * (2 * 256 + 8 * 16) * 1,000,000 ~= 0.656 GB @@ -260,9 +260,9 @@ As an example, assume that you have 1 million vectors with a dimension of 256 an #### IVF memory estimation -The memory required for IVF is estimated to be `1.1 * (((2 * dimension) * num_vectors) + (4 * nlist * d))` bytes/vector. +The memory required for IVF is estimated to be `1.1 * (((2 * dimension) * num_vectors) + (4 * nlist * dimension))` bytes/vector, where `nlist` is the number of buckets to partition vectors into. -As an example, assume that you have 1 million vectors with a dimension of 256 and `nlist` of 128. The memory requirement can be estimated as follows: +As an example, assume that you have 1 million vectors with a dimension of 256 and an `nlist` of 128. The memory requirement can be estimated as follows: ```r 1.1 * (((2 * 256) * 1,000,000) + (4 * 128 * 256)) ~= 0.525 GB @@ -310,3 +310,175 @@ For example, assume that you have 1 million vectors with a dimension of 256, `iv ```r 1.1*((8 / 8 * 64 + 24) * 1000000 + 100 * (2^8 * 4 * 256 + 4 * 512 * 256)) ~= 0.171 GB ``` + +## Binary quantization + +Starting with version 2.17, OpenSearch supports BQ with binary vector support for the Faiss engine. BQ compresses vectors into a binary format (0s and 1s), making it highly efficient in terms of memory usage. You can choose to represent each vector dimension using 1, 2, or 4 bits, depending on the desired precision. One of the advantages of using BQ is that the training process is handled automatically during indexing. This means that no separate training step is required, unlike other quantization techniques such as PQ. + +### Using BQ +To configure BQ for the Faiss engine, define a `knn_vector` field and specify the `mode` as `on_disk`. This configuration defaults to 1-bit BQ and both `ef_search` and `ef_construction` set to `100`: + +```json +PUT my-vector-index +{ + "mappings": { + "properties": { + "my_vector_field": { + "type": "knn_vector", + "dimension": 8, + "space_type": "l2", + "data_type": "float", + "mode": "on_disk" + } + } + } +} +``` +{% include copy-curl.html %} + +To further optimize the configuration, you can specify additional parameters, such as the compression level, and fine-tune the search parameters. For example, you can override the `ef_construction` value or define the compression level, which corresponds to the number of bits used for quantization: + +- **32x compression** for 1-bit quantization +- **16x compression** for 2-bit quantization +- **8x compression** for 4-bit quantization + +This allows for greater control over memory usage and recall performance, providing flexibility to balance between precision and storage efficiency. + +To specify the compression level, set the `compression_level` parameter: + +```json +PUT my-vector-index +{ + "mappings": { + "properties": { + "my_vector_field": { + "type": "knn_vector", + "dimension": 8, + "space_type": "l2", + "data_type": "float", + "mode": "on_disk", + "compression_level": "16x", + "method": { + "params": { + "ef_construction": 16 + } + } + } + } + } +} +``` +{% include copy-curl.html %} + +The following example further fine-tunes the configuration by defining `ef_construction`, `encoder`, and the number of `bits` (which can be `1`, `2`, or `4`): + +```json +PUT my-vector-index +{ + "mappings": { + "properties": { + "my_vector_field": { + "type": "knn_vector", + "dimension": 8, + "method": { + "name": "hnsw", + "engine": "faiss", + "space_type": "l2", + "params": { + "m": 16, + "ef_construction": 512, + "encoder": { + "name": "binary", + "parameters": { + "bits": 1 + } + } + } + } + } + } + } +} +``` +{% include copy-curl.html %} + +### Search using binary quantized vectors + +You can perform a k-NN search on your index by providing a vector and specifying the number of nearest neighbors (k) to return: + +```json +GET my-vector-index/_search +{ + "size": 2, + "query": { + "knn": { + "my_vector_field": { + "vector": [1.5, 5.5, 1.5, 5.5, 1.5, 5.5, 1.5, 5.5], + "k": 10 + } + } + } +} +``` +{% include copy-curl.html %} + +You can also fine-tune search by providing the `ef_search` and `oversample_factor` parameters. +The `oversample_factor` parameter controls the factor by which the search oversamples the candidate vectors before ranking them. Using a higher oversample factor means that more candidates will be considered before ranking, improving accuracy but also increasing search time. When selecting the `oversample_factor` value, consider the trade-off between accuracy and efficiency. For example, setting the `oversample_factor` to `2.0` will double the number of candidates considered during the ranking phase, which may help achieve better results. + +The following request specifies the `ef_search` and `oversample_factor` parameters: + +```json +GET my-vector-index/_search +{ + "size": 2, + "query": { + "knn": { + "my_vector_field": { + "vector": [1.5, 5.5, 1.5, 5.5, 1.5, 5.5, 1.5, 5.5], + "k": 10, + "method_parameters": { + "ef_search": 10 + }, + "rescore": { + "oversample_factor": 10.0 + } + } + } + } +} +``` +{% include copy-curl.html %} + + +#### HNSW memory estimation + +The memory required for the Hierarchical Navigable Small World (HNSW) graph can be estimated as `1.1 * (dimension + 8 * m)` bytes/vector, where `m` is the maximum number of bidirectional links created for each element during the construction of the graph. + +As an example, assume that you have 1 million vectors with a dimension of 256 and an `m` of 16. The following sections provide memory requirement estimations for various compression values. + +##### 1-bit quantization (32x compression) + +In 1-bit quantization, each dimension is represented using 1 bit, equivalent to a 32x compression factor. The memory requirement can be estimated as follows: + +```r +Memory = 1.1 * ((256 * 1 / 8) + 8 * 16) * 1,000,000 + ~= 0.176 GB +``` + +##### 2-bit quantization (16x compression) + +In 2-bit quantization, each dimension is represented using 2 bits, equivalent to a 16x compression factor. The memory requirement can be estimated as follows: + +```r +Memory = 1.1 * ((256 * 2 / 8) + 8 * 16) * 1,000,000 + ~= 0.211 GB +``` + +##### 4-bit quantization (8x compression) + +In 4-bit quantization, each dimension is represented using 4 bits, equivalent to an 8x compression factor. The memory requirement can be estimated as follows: + +```r +Memory = 1.1 * ((256 * 4 / 8) + 8 * 16) * 1,000,000 + ~= 0.282 GB +``` diff --git a/_search-plugins/knn/nested-search-knn.md b/_search-plugins/knn/nested-search-knn.md index d947ebc6e66..bbba6c9c1e5 100644 --- a/_search-plugins/knn/nested-search-knn.md +++ b/_search-plugins/knn/nested-search-knn.md @@ -38,9 +38,9 @@ PUT my-knn-index-1 "my_vector": { "type": "knn_vector", "dimension": 3, + "space_type": "l2", "method": { "name": "hnsw", - "space_type": "l2", "engine": "lucene", "parameters": { "ef_construction": 100, @@ -324,9 +324,9 @@ PUT my-knn-index-1 "my_vector": { "type": "knn_vector", "dimension": 3, + "space_type": "l2", "method": { "name": "hnsw", - "space_type": "l2", "engine": "lucene", "parameters": { "ef_construction": 100, diff --git a/_search-plugins/knn/painless-functions.md b/_search-plugins/knn/painless-functions.md index cc27776fc45..7a8d9fec7bd 100644 --- a/_search-plugins/knn/painless-functions.md +++ b/_search-plugins/knn/painless-functions.md @@ -55,7 +55,7 @@ l1Norm | `float l1Norm (float[] queryVector, doc['vector field'])` | This functi cosineSimilarity | `float cosineSimilarity (float[] queryVector, doc['vector field'])` | Cosine similarity is an inner product of the query vector and document vector normalized to both have a length of 1. If the magnitude of the query vector doesn't change throughout the query, you can pass the magnitude of the query vector to improve performance, instead of calculating the magnitude every time for every filtered document:
`float cosineSimilarity (float[] queryVector, doc['vector field'], float normQueryVector)`
In general, the range of cosine similarity is [-1, 1]. However, in the case of information retrieval, the cosine similarity of two documents ranges from 0 to 1 because the tf-idf statistic can't be negative. Therefore, the k-NN plugin adds 1.0 in order to always yield a positive cosine similarity score. hamming | `float hamming (float[] queryVector, doc['vector field'])` | This function calculates the Hamming distance between a given query vector and document vectors. The Hamming distance is the number of positions at which the corresponding elements are different. The shorter the distance, the more relevant the document is, so this example inverts the return value of the Hamming distance. -The `hamming` space type is supported for binary vectors in OpenSearch version 2.16 and later. For more information, see [Binary k-NN vectors]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-k-nn-vectors). +The `hamming` space type is supported for binary vectors in OpenSearch version 2.16 and later. For more information, see [Binary k-NN vectors]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-vectors). {: .note} ## Constraints diff --git a/_search-plugins/knn/performance-tuning.md b/_search-plugins/knn/performance-tuning.md index 123b1daef19..77f44dee93f 100644 --- a/_search-plugins/knn/performance-tuning.md +++ b/_search-plugins/knn/performance-tuning.md @@ -59,9 +59,9 @@ The `_source` field contains the original JSON document body that was passed at "location": { "type": "knn_vector", "dimension": 2, + "space_type": "l2", "method": { "name": "hnsw", - "space_type": "l2", "engine": "faiss" } } @@ -85,9 +85,9 @@ In OpenSearch 2.15 or later, you can further improve indexing speed and reduce d "location": { "type": "knn_vector", "dimension": 2, + "space_type": "l2", "method": { "name": "hnsw", - "space_type": "l2", "engine": "faiss" } } diff --git a/_search-plugins/knn/radial-search-knn.md b/_search-plugins/knn/radial-search-knn.md index 1a4a2232942..e5449a0993a 100644 --- a/_search-plugins/knn/radial-search-knn.md +++ b/_search-plugins/knn/radial-search-knn.md @@ -53,9 +53,9 @@ PUT knn-index-test "my_vector": { "type": "knn_vector", "dimension": 2, + "space_type": "l2", "method": { "name": "hnsw", - "space_type": "l2", "engine": "faiss", "parameters": { "ef_construction": 100, diff --git a/_search-plugins/search-pipelines/using-search-pipeline.md b/_search-plugins/search-pipelines/using-search-pipeline.md index ecb988ad110..b6dbbdc5d0d 100644 --- a/_search-plugins/search-pipelines/using-search-pipeline.md +++ b/_search-plugins/search-pipelines/using-search-pipeline.md @@ -17,14 +17,45 @@ You can use a search pipeline in the following ways: ## Specifying an existing search pipeline for a request -After you [create a search pipeline]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/creating-search-pipeline/), you can use the pipeline with a query by specifying the pipeline name in the `search_pipeline` query parameter: +After you [create a search pipeline]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/creating-search-pipeline/), you can use the pipeline with a query in the following ways. For a complete example of using a search pipeline with a `filter_query` processor, see [`filter_query` processor example]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/filter-query-processor#example). + +### Specifying the pipeline in a query parameter + +You can specify the pipeline name in the `search_pipeline` query parameter as follows: ```json GET /my_index/_search?search_pipeline=my_pipeline ``` {% include copy-curl.html %} -For a complete example of using a search pipeline with a `filter_query` processor, see [`filter_query` processor example]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/filter-query-processor#example). +### Specifying the pipeline in the request body + +You can provide a search pipeline ID in the search request body as follows: + +```json +GET /my-index/_search +{ + "query": { + "match_all": {} + }, + "from": 0, + "size": 10, + "search_pipeline": "my_pipeline" +} +``` +{% include copy-curl.html %} + +For multi-search, you can provide a search pipeline ID in the search request body as follows: + +```json +GET /_msearch +{ "index": "test"} +{ "query": { "match_all": {} }, "from": 0, "size": 10, "search_pipeline": "my_pipeline"} +{ "index": "test-1", "search_type": "dfs_query_then_fetch"} +{ "query": { "match_all": {} }, "search_pipeline": "my_pipeline1" } + +``` +{% include copy-curl.html %} ## Using a temporary search pipeline for a request diff --git a/_search-plugins/searching-data/inner-hits.md b/_search-plugins/searching-data/inner-hits.md index 395e9e748ab..5eda9498b5f 100644 --- a/_search-plugins/searching-data/inner-hits.md +++ b/_search-plugins/searching-data/inner-hits.md @@ -806,4 +806,8 @@ The following is the expected result: Using `inner_hits` provides contextual relevance by showing exactly which nested or child documents match the query criteria. This is crucial for applications in which the relevance of results depends on a specific part of the document that matches the query. - Example use case: In a customer support system, you have tickets as parent documents and comments or updates as nested or child documents. You can determine which specific comment matches the search in order to better understand the context of the ticket search. \ No newline at end of file + Example use case: In a customer support system, you have tickets as parent documents and comments or updates as nested or child documents. You can determine which specific comment matches the search in order to better understand the context of the ticket search. + +## Next steps + +- Learn about [joining queries]({{site.url}}{{site.baseurl}}/query-dsl/joining/) on [nested]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/nested/) or [join]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/join/) fields. \ No newline at end of file diff --git a/_search-plugins/vector-search.md b/_search-plugins/vector-search.md index cd893f41441..f19030bf907 100644 --- a/_search-plugins/vector-search.md +++ b/_search-plugins/vector-search.md @@ -37,9 +37,9 @@ PUT test-index "my_vector1": { "type": "knn_vector", "dimension": 1024, + "space_type": "l2", "method": { "name": "hnsw", - "space_type": "l2", "engine": "nmslib", "parameters": { "ef_construction": 128, @@ -57,7 +57,7 @@ PUT test-index You must designate the field that will store vectors as a [`knn_vector`]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector/) field type. OpenSearch supports vectors of up to 16,000 dimensions, each of which is represented as a 32-bit or 16-bit float. -To save storage space, you can use `byte` or `binary` vectors. For more information, see [Lucene byte vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#lucene-byte-vector) and [Binary k-NN vectors]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-k-nn-vectors). +To save storage space, you can use `byte` or `binary` vectors. For more information, see [Byte vectors]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#byte-vectors) and [Binary vectors]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-vectors). ### k-NN vector search @@ -131,9 +131,9 @@ PUT /hotels-index "location": { "type": "knn_vector", "dimension": 2, + "space_type": "l2", "method": { "name": "hnsw", - "space_type": "l2", "engine": "lucene", "parameters": { "ef_construction": 100, diff --git a/_security/access-control/document-level-security.md b/_security/access-control/document-level-security.md index 352fe06a619..b17b60e147b 100644 --- a/_security/access-control/document-level-security.md +++ b/_security/access-control/document-level-security.md @@ -13,6 +13,8 @@ Document-level security lets you restrict a role to a subset of documents in an ![Document- and field-level security screen in OpenSearch Dashboards]({{site.url}}{{site.baseurl}}/images/security-dls.png) +The maximum size for the document-level security configuration is 1024 KB (1,048,404 characters). +{: .warning} ## Simple roles diff --git a/_security/audit-logs/index.md b/_security/audit-logs/index.md index becb001ec07..8eeea334473 100644 --- a/_security/audit-logs/index.md +++ b/_security/audit-logs/index.md @@ -224,3 +224,36 @@ plugins.security.audit.config.threadpool.max_queue_len: 100000 To disable audit logs after they've been enabled, remove the `plugins.security.audit.type: internal_opensearch` setting from `opensearch.yml`, or switch off the **Enable audit logging** check box in OpenSearch Dashboards. +## Audit user account manipulation + +To enable audit logging on changes to a security index, such as changes to roles mappings and role creation or deletion, use the following settings in the `compliance:` portion of the audit log configuration, as shown in the following example: + +``` +_meta: + type: "audit" + config_version: 2 + +config: + # enable/disable audit logging + enabled: true + + ... + + + compliance: + # enable/disable compliance + enabled: true + + # Log updates to internal security changes + internal_config: true + + # Log only metadata of the document for write events + write_metadata_only: false + + # Log only diffs for document updates + write_log_diffs: true + + # List of indices to watch for write events. Wildcard patterns are supported + # write_watched_indices: ["twitter", "logs-*"] + write_watched_indices: [".opendistro_security"] +``` diff --git a/_security/authentication-backends/jwt.md b/_security/authentication-backends/jwt.md index 3f28dfecfd9..6c7311e7dc4 100644 --- a/_security/authentication-backends/jwt.md +++ b/_security/authentication-backends/jwt.md @@ -117,7 +117,7 @@ The following table lists the configuration parameters. Name | Description :--- | :--- -`signing_key` | The signing key to use when verifying the token. If you use a symmetric key algorithm, it is the base64-encoded shared secret. If you use an asymmetric algorithm, it contains the public key. +`signing_key` | The signing key(s) used to verify the token. If you use a symmetric key algorithm, this is the Base64-encoded shared secret. If you use an asymmetric algorithm, the algorithm contains the public key. To pass multiple keys, use a comma-separated list or enumerate the keys. `jwt_header` | The HTTP header in which the token is transmitted. This is typically the `Authorization` header with the `Bearer` schema,`Authorization: Bearer `. Default is `Authorization`. Replacing this field with a value other than `Authorization` prevents the audit log from properly redacting the JWT header from audit messages. It is recommended that users only use `Authorization` when using JWTs with audit logging. `jwt_url_parameter` | If the token is not transmitted in the HTTP header but rather as an URL parameter, define the name of the parameter here. `subject_key` | The key in the JSON payload that stores the username. If not set, the [subject](https://tools.ietf.org/html/rfc7519#section-4.1.2) registered claim is used. diff --git a/_security/configuration/disable-enable-security.md b/_security/configuration/disable-enable-security.md index 811fd2a69fd..38bcc01cdd2 100755 --- a/_security/configuration/disable-enable-security.md +++ b/_security/configuration/disable-enable-security.md @@ -155,22 +155,22 @@ Use the following steps to reinstall the plugin: 1. Disable shard allocation and stop all nodes so that shards don't move when the cluster is restarted: - ```json - curl -XPUT "http://localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d '{ - "transient": { - "cluster.routing.allocation.enable": "none" - } - }' - ``` - {% include copy.html %} + ```json + curl -XPUT "http://localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d '{ + "transient": { + "cluster.routing.allocation.enable": "none" + } + }' + ``` + {% include copy.html %} 2. Install the Security plugin on all nodes in your cluster using one of the [installation methods]({{site.url}}{{site.baseurl}}/install-and-configure/plugins/#install): - ```bash - bin/opensearch-plugin install opensearch-security - ``` - {% include copy.html %} - + ```bash + bin/opensearch-plugin install opensearch-security + ``` + {% include copy.html %} + 3. Add the necessary configuration to `opensearch.yml` for TLS encryption. See [Configuration]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/security-settings/) for information about the settings that need to be configured. diff --git a/_security/configuration/index.md b/_security/configuration/index.md index 31292c320a7..f68667d92df 100644 --- a/_security/configuration/index.md +++ b/_security/configuration/index.md @@ -3,7 +3,7 @@ layout: default title: Configuration nav_order: 2 has_children: true -has_toc: false +has_toc: true redirect_from: - /security-plugin/configuration/ - /security-plugin/configuration/index/ @@ -11,21 +11,105 @@ redirect_from: # Security configuration -The plugin includes demo certificates so that you can get up and running quickly. To use OpenSearch in a production environment, you must configure it manually: +The Security plugin includes demo certificates so that you can get up and running quickly. To use OpenSearch with the Security plugin in a production environment, you must make changes to the demo certificates and other configuration options manually. -1. [Replace the demo certificates]({{site.url}}{{site.baseurl}}/install-and-configure/install-opensearch/docker/#configuring-basic-security-settings). -1. [Reconfigure `opensearch.yml` to use your certificates]({{site.url}}{{site.baseurl}}/security/configuration/tls). -1. [Reconfigure `config.yml` to use your authentication backend]({{site.url}}{{site.baseurl}}/security/configuration/configuration/) (if you don't plan to use the internal user database). -1. [Modify the configuration YAML files]({{site.url}}{{site.baseurl}}/security/configuration/yaml). -1. If you plan to use the internal user database, [set a password policy in `opensearch.yml`]({{site.url}}{{site.baseurl}}/security/configuration/yaml/#opensearchyml). -1. [Apply changes using the `securityadmin` script]({{site.url}}{{site.baseurl}}/security/configuration/security-admin). -1. Start OpenSearch. -1. [Add users, roles, role mappings, and tenants]({{site.url}}{{site.baseurl}}/security/access-control/index/). +## Replace the demo certificates -If you don't want to use the plugin, see [Disable security]({{site.url}}{{site.baseurl}}/security/configuration/disable-enable-security/). +OpenSearch ships with demo certificates intended for quick setup and demonstration purposes. For a production environment, it's critical to replace these with your own trusted certificates, using the following steps, to ensure secure communication: -The Security plugin has several default users, roles, action groups, permissions, and settings for OpenSearch Dashboards that use kibana in their names. We will change these names in a future release. +1. **Generate your own certificates:** Use tools like OpenSSL or a certificate authority (CA) to generate your own certificates. For more information about generating certificates with OpenSSL, see [Generating self-signed certificates]({{site.url}}{{site.baseurl}}/security/configuration/generate-certificates/). +2. **Store the generated certificates and private key in the appropriate directory:** Generated certificates are typically stored in `/config/`. For more information, see [Add certificate files to opensearch.yml]({{site.url}}{{site.baseurl}}/security/configuration/generate-certificates/#add-certificate-files-to-opensearchyml). +3. **Set the following file permissions:** + - Private key (.key files): Set the file mode to `600`. This restricts access so that only the file owner (the OpenSearch user) can read and write to the file, ensuring that the private key remains secure and inaccessible to unauthorized users. + - Public certificates (.crt, .pem files): Set the file mode to `644`. This allows the file owner to read and write to the file, while other users can only read it. + +For additional guidance on file modes, see the following table. + + | Item | Sample | Numeric | Bitwise | + |-------------|---------------------|---------|--------------| + | Public key | `~/.ssh/id_rsa.pub` | `644` | `-rw-r--r--` | + | Private key | `~/.ssh/id_rsa` | `600` | `-rw-------` | + | SSH folder | `~/.ssh` | `700` | `drwx------` | + +For more information, see [Configuring basic security settings]({{site.url}}{{site.baseurl}}/install-and-configure/install-opensearch/docker/#configuring-basic-security-settings). + +## Reconfigure `opensearch.yml` to use your certificates + +The `opensearch.yml` file is the main configuration file for OpenSearch; you can find the file at `/config/opensearch.yml`. Use the following steps to update this file to point to your custom certificates: + +In `opensearch.yml`, set the correct paths for your certificates and keys, as shown in the following example: + ``` + plugins.security.ssl.transport.pemcert_filepath: /path/to/your/cert.pem + plugins.security.ssl.transport.pemkey_filepath: /path/to/your/key.pem + plugins.security.ssl.transport.pemtrustedcas_filepath: /path/to/your/ca.pem + plugins.security.ssl.http.enabled: true + plugins.security.ssl.http.pemcert_filepath: /path/to/your/cert.pem + plugins.security.ssl.http.pemkey_filepath: /path/to/your/key.pem + plugins.security.ssl.http.pemtrustedcas_filepath: /path/to/your/ca.pem + ``` +For more information, see [Configuring TLS certificates]({{site.url}}{{site.baseurl}}/security/configuration/tls/). + +## Reconfigure `config.yml` to use your authentication backend + +The `config.yml` file allows you to configure the authentication and authorization mechanisms for OpenSearch. Update the authentication backend settings in `/config/opensearch-security/config.yml` according to your requirements. + +For example, to use LDAP as your authentication backend, add the following settings: + + ``` + authc: + basic_internal_auth: + http_enabled: true + transport_enabled: true + order: 1 + http_authenticator: + type: basic + challenge: true + authentication_backend: + type: internal + ``` +For more information, see [Configuring the Security backend]({{site.url}}{{site.baseurl}}/security/configuration/configuration/). + +## Modify the configuration YAML files + +Determine whether any additional YAML files need modification, for example, the `roles.yml`, `roles_mapping.yml`, or `internal_users.yml` files. Update the files with any additional configuration information. For more information, see [Modifying the YAML files]({{site.url}}{{site.baseurl}}/security/configuration/yaml/). + +## Set a password policy + +When using the internal user database, we recommend enforcing a password policy to ensure that strong passwords are used. For information about strong password policies, see [Password settings]({{site.url}}{{site.baseurl}}/security/configuration/yaml/#password-settings). + +## Apply changes using the `securityadmin` script + +The following steps do not apply to first-time users because the security index is automatically initialized from the YAML configuration files when OpenSearch starts. +{: .note} + +After initial setup, if you make changes to your security configuration or disable automatic initialization by setting `plugins.security.allow_default_init_securityindex` to `false` (which prevents security index initialization from `yaml` files), you need to manually apply changes using the `securityadmin` script: + +1. Find the `securityadmin` script. The script is typically stored in the OpenSearch plugins directory, `plugins/opensearch-security/tools/securityadmin.[sh|bat]`. + - Note: If you're using OpenSearch 1.x, the `securityadmin` script is located in the `plugins/opendistro_security/tools/` directory. + - For more information, see [Basic usage](https://opensearch.org/docs/latest/security/configuration/security-admin/#basic-usage). +2. Run the script by using the following command: + ``` + ./plugins/opensearch-security/tools/securityadmin.[sh|bat] + ``` +3. Check the OpenSearch logs and configuration to ensure that the changes have been successfully applied. + +For more information about using the `securityadmin` script, see [Applying changes to configuration files]({{site.url}}{{site.baseurl}}/security/configuration/security-admin/). + +## Add users, roles, role mappings, and tenants + +If you don't want to use the Security plugin, you can disable it by adding the following setting to the `opensearch.yml` file: + +``` +plugins.security.disabled: true +``` + +You can then enable the plugin by removing the `plugins.security.disabled` setting. + +For more information about disabling the Security plugin, see [Disable security]({{site.url}}{{site.baseurl}}/security/configuration/disable-enable-security/). + +The Security plugin has several default users, roles, action groups, permissions, and settings for OpenSearch Dashboards that contain "Kibana" in their names. We will change these names in a future version. {: .note } -For a full list of `opensearch.yml` Security plugin settings, Security plugin settings, see [Security settings]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/security-settings/). -{: .note} \ No newline at end of file +For a full list of `opensearch.yml` Security plugin settings, see [Security settings]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/security-settings/). +{: .note} + diff --git a/_security/configuration/security-admin.md b/_security/configuration/security-admin.md index a03d30fd034..b4d23dce5b5 100755 --- a/_security/configuration/security-admin.md +++ b/_security/configuration/security-admin.md @@ -23,13 +23,13 @@ The `securityadmin.sh` script requires SSL/TLS HTTP to be enabled for your OpenS ## A word of caution -If you make changes to the configuration files in `config/opensearch-security`, OpenSearch does _not_ automatically apply these changes. Instead, you must run `securityadmin.sh` to load the updated files into the index. +If you make changes to the configuration files in `config/opensearch-security`, OpenSearch does _not_ automatically apply these changes. Instead, you must run `securityadmin.sh` to load the updated files into the index. The `securityadmin.sh` file can be found in `/plugins/opensearch-security/tools/securityadmin.[sh|bat]`. Running `securityadmin.sh` **overwrites** one or more portions of the `.opendistro_security` index. Run it with extreme care to avoid losing your existing resources. Consider the following example: 1. You initialize the `.opendistro_security` index. 1. You create ten users using the REST API. -1. You decide to create a new [reserved user]({{site.url}}{{site.baseurl}}/security/access-control/api/#reserved-and-hidden-resources) using `internal_users.yml`. +1. You decide to create a new [reserved user]({{site.url}}{{site.baseurl}}/security/access-control/api/#reserved-and-hidden-resources) using `internal_users.yml`, found in `/config/opensearch-security/` directory. 1. You run `securityadmin.sh` again to load the new reserved user into the index. 1. You lose all ten users that you created using the REST API. diff --git a/_security/configuration/yaml.md b/_security/configuration/yaml.md index 3aabce53d53..2694e3a24fc 100644 --- a/_security/configuration/yaml.md +++ b/_security/configuration/yaml.md @@ -15,10 +15,86 @@ Before running [`securityadmin.sh`]({{site.url}}{{site.baseurl}}/security/config The approach we recommend for using the YAML files is to first configure [reserved and hidden resources]({{site.url}}{{site.baseurl}}/security/access-control/api#reserved-and-hidden-resources), such as the `admin` and `kibanaserver` users. Thereafter you can create other users, roles, mappings, action groups, and tenants using OpenSearch Dashboards or the REST API. +## action_groups.yml + +This file contains any role mappings required for your security configuration. You can find the `role_mapping.yml` file in `/config/opensearch-security/roles_mapping.yml`. + +Aside from some metadata, the default file is empty, because the Security plugin has a number of static action groups that it adds automatically. These static action groups cover a wide variety of use cases and are a great way to get started with the plugin. + +```yml +--- +my-action-group: + reserved: false + hidden: false + allowed_actions: + - "indices:data/write/index*" + - "indices:data/write/update*" + - "indices:admin/mapping/put" + - "indices:data/write/bulk*" + - "read" + - "write" + static: false +_meta: + type: "actiongroups" + config_version: 2 +``` + +## allowlist.yml + +You can use `allowlist.yml` to add any endpoints and HTTP requests to a list of allowed endpoints and requests. If enabled, all users except the super admin are allowed access to only the specified endpoints and HTTP requests, and all other HTTP requests associated with the endpoint are denied. For example, if GET `_cluster/settings` is added to the allow list, users cannot submit PUT requests to `_cluster/settings` to update cluster settings. + +You can find the `allowlist.yml` file in `/config/opensearch-security/allowlist.yml`. + +Note that while you can configure access to endpoints this way, for most cases, it is still best to configure permissions using the Security plugin's users and roles, which have more granular settings. + +```yml +--- +_meta: + type: "allowlist" + config_version: 2 + +# Description: +# enabled - feature flag. +# if enabled is false, all endpoints are accessible. +# if enabled is true, all users except the SuperAdmin can only submit the allowed requests to the specified endpoints. +# SuperAdmin can access all APIs. +# SuperAdmin is defined by the SuperAdmin certificate, which is configured with the opensearch.yml setting plugins.security.authcz.admin_dn: +# Refer to the example setting in opensearch.yml to learn more about configuring SuperAdmin. +# +# requests - map of allow listed endpoints and HTTP requests + +#this name must be config +config: + enabled: true + requests: + /_cluster/settings: + - GET + /_cat/nodes: + - GET +``` + +To enable PUT requests to cluster settings, add PUT to the list of allowed operations under `/_cluster/settings`. + +```yml +requests: + /_cluster/settings: + - GET + - PUT +``` + +You can also add custom indexes to the allow list. `allowlist.yml` doesn't support wildcards, so you must manually specify all of the indexes you want to add. + +```yml +requests: # Only allow GET requests to /sample-index1/_doc/1 and /sample-index2/_doc/1 + /sample-index1/_doc/1: + - GET + /sample-index2/_doc/1: + - GET +``` ## internal_users.yml -This file contains any initial users that you want to add to the Security plugin's internal user database. +This file contains any initial users that you want to add to the Security plugin's internal user database. You can find this file in ``/config/opensearch-security/internal_users.yml`. The file format requires a hashed password. To generate one, run `plugins/opensearch-security/tools/hash.sh -p `. If you decide to keep any of the demo users, *change their passwords* and re-run [securityadmin.sh]({{site.url}}{{site.baseurl}}/security/configuration/security-admin/) to apply the new passwords. @@ -92,196 +168,24 @@ snapshotrestore: description: "Demo snapshotrestore user" ``` -## opensearch.yml - -In addition to many OpenSearch settings, this file contains paths to TLS certificates and their attributes, such as distinguished names and trusted certificate authorities. - -```yml -plugins.security.ssl.transport.pemcert_filepath: esnode.pem -plugins.security.ssl.transport.pemkey_filepath: esnode-key.pem -plugins.security.ssl.transport.pemtrustedcas_filepath: root-ca.pem -plugins.security.ssl.transport.enforce_hostname_verification: false -plugins.security.ssl.http.enabled: true -plugins.security.ssl.http.pemcert_filepath: esnode.pem -plugins.security.ssl.http.pemkey_filepath: esnode-key.pem -plugins.security.ssl.http.pemtrustedcas_filepath: root-ca.pem -plugins.security.allow_unsafe_democertificates: true -plugins.security.allow_default_init_securityindex: true -plugins.security.authcz.admin_dn: - - CN=kirk,OU=client,O=client,L=test, C=de - -plugins.security.audit.type: internal_opensearch -plugins.security.enable_snapshot_restore_privilege: true -plugins.security.check_snapshot_restore_write_privileges: true -plugins.security.cache.ttl_minutes: 60 -plugins.security.restapi.roles_enabled: ["all_access", "security_rest_api_access"] -plugins.security.system_indices.enabled: true -plugins.security.system_indices.indices: [".opendistro-alerting-config", ".opendistro-alerting-alert*", ".opendistro-anomaly-results*", ".opendistro-anomaly-detector*", ".opendistro-anomaly-checkpoints", ".opendistro-anomaly-detection-state", ".opendistro-reports-*", ".opendistro-notifications-*", ".opendistro-notebooks", ".opendistro-asynchronous-search-response*"] -node.max_local_storage_nodes: 3 -``` - -For a full list of `opensearch.yml` Security plugin settings, see [Security settings]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/security-settings/). -{: .note} - -### Refining your configuration - -The `plugins.security.allow_default_init_securityindex` setting, when set to `true`, sets the Security plugin to its default security settings if an attempt to create the security index fails when OpenSearch launches. Default security settings are stored in YAML files contained in the `opensearch-project/security/config` directory. By default, this setting is `false`. - -```yml -plugins.security.allow_default_init_securityindex: true -``` - -An authentication cache for the Security plugin exists to help speed up authentication by temporarily storing user objects returned from the backend so that the Security plugin is not required to make repeated requests for them. To determine how long it takes for caching to time out, you can use the `plugins.security.cache.ttl_minutes` property to set a value in minutes. The default is `60`. You can disable caching by setting the value to `0`. - -```yml -plugins.security.cache.ttl_minutes: 60 -``` - -### Enabling user access to system indexes - -Mapping a system index permission to a user allows that user to modify the system index specified in the permission's name (the one exception is the Security plugin's [system index]({{site.url}}{{site.baseurl}}/security/configuration/system-indices/)). The `plugins.security.system_indices.permission.enabled` setting provides a way for administrators to make this permission available for or hidden from role mapping. - -When set to `true`, the feature is enabled and users with permission to modify roles can create roles that include permissions that grant access to system indexes: - -```yml -plugins.security.system_indices.permission.enabled: true -``` - -When set to `false`, the permission is disabled and only admins with an admin certificate can make changes to system indexes. By default, the permission is set to `false` in a new cluster. - -To learn more about system index permissions, see [System index permissions]({{site.url}}{{site.baseurl}}/security/access-control/permissions/#system-index-permissions). - - -### Password settings - -If you want to run your users' passwords against some validation, specify a regular expression (regex) in this file. You can also include an error message that loads when passwords don't pass validation. The following example demonstrates how to include a regex so OpenSearch requires new passwords to be a minimum of eight characters with at least one uppercase, one lowercase, one digit, and one special character. - -Note that OpenSearch validates only users and passwords created through OpenSearch Dashboards or the REST API. - -```yml -plugins.security.restapi.password_validation_regex: '(?=.*[A-Z])(?=.*[^a-zA-Z\d])(?=.*[0-9])(?=.*[a-z]).{8,}' -plugins.security.restapi.password_validation_error_message: "Password must be minimum 8 characters long and must contain at least one uppercase letter, one lowercase letter, one digit, and one special character." -``` - -In addition, a score-based password strength estimator allows you to set a threshold for password strength when creating a new internal user or updating a user's password. This feature makes use of the [zxcvbn library](https://github.com/dropbox/zxcvbn) to apply a policy that emphasizes a password's complexity rather than its capacity to meet traditional criteria such as uppercase keys, numerals, and special characters. - -For information about defining users, see [Defining users]({{site.url}}{{site.baseurl}}/security/access-control/users-roles/#defining-users). - -This feature is not compatible with users specified as reserved. For information about reserved resources, see [Reserved and hidden resources]({{site.url}}{{site.baseurl}}/security/access-control/api#reserved-and-hidden-resources). -{: .important } - -Score-based password strength requires two settings to configure the feature. The following table describes the two settings. - -| Setting | Description | -| :--- | :--- | -| `plugins.security.restapi.password_min_length` | Sets the minimum number of characters for the password length. The default is `8`. This is also the minimum. | -| `plugins.security.restapi.password_score_based_validation_strength` | Sets a threshold to determine whether the password is strong or weak. There are four values that represent a threshold's increasing complexity.
`fair`--A very "guessable" password: provides protection from throttled online attacks.
`good`--A somewhat guessable password: provides protection from unthrottled online attacks.
`strong`--A safely "unguessable" password: provides moderate protection from an offline, slow-hash scenario.
`very_strong`--A very unguessable password: provides strong protection from an offline, slow-hash scenario. | - -The following example shows the settings configured for the `opensearch.yml` file and enabling a password with a minimum of 10 characters and a threshold requiring the highest strength: - -```yml -plugins.security.restapi.password_min_length: 10 -plugins.security.restapi.password_score_based_validation_strength: very_strong -``` - -When you try to create a user with a password that doesn't reach the specified threshold, the system generates a "weak password" warning, indicating that the password needs to be modified before you can save the user. - -The following example shows the response from the [Create user]({{site.url}}{{site.baseurl}}/security/access-control/api/#create-user) API when the password is weak: - -```json -{ - "status": "error", - "reason": "Weak password" -} -``` - -## allowlist.yml +## nodes_dn.yml -You can use `allowlist.yml` to add any endpoints and HTTP requests to a list of allowed endpoints and requests. If enabled, all users except the super admin are allowed access to only the specified endpoints and HTTP requests, and all other HTTP requests associated with the endpoint are denied. For example, if GET `_cluster/settings` is added to the allow list, users cannot submit PUT requests to `_cluster/settings` to update cluster settings. +`nodes_dn.yml` lets you add certificates' [distinguished names (DNs)]({{site.url}}{{site.baseurl}}/security/configuration/generate-certificates/#add-distinguished-names-to-opensearchyml) to an allow list to enable communication between any number of nodes or clusters. For example, a node that has the DN `CN=node1.example.com` in its allow list accepts communication from any other node or certificate that uses that DN. -Note that while you can configure access to endpoints this way, for most cases, it is still best to configure permissions using the Security plugin's users and roles, which have more granular settings. +The DNs get indexed into a [system index]({{site.url}}{{site.baseurl}}/security/configuration/system-indices) that only a super admin or an admin with a Transport Layer Security (TLS) certificate can access. If you want to programmatically add DNs to your allow lists, use the [REST API]({{site.url}}{{site.baseurl}}/security/access-control/api/#distinguished-names). ```yml --- _meta: - type: "allowlist" + type: "nodesdn" config_version: 2 -# Description: -# enabled - feature flag. -# if enabled is false, all endpoints are accessible. -# if enabled is true, all users except the SuperAdmin can only submit the allowed requests to the specified endpoints. -# SuperAdmin can access all APIs. -# SuperAdmin is defined by the SuperAdmin certificate, which is configured with the opensearch.yml setting plugins.security.authcz.admin_dn: -# Refer to the example setting in opensearch.yml to learn more about configuring SuperAdmin. -# -# requests - map of allow listed endpoints and HTTP requests - -#this name must be config -config: - enabled: true - requests: - /_cluster/settings: - - GET - /_cat/nodes: - - GET -``` - -To enable PUT requests to cluster settings, add PUT to the list of allowed operations under `/_cluster/settings`. - -```yml -requests: - /_cluster/settings: - - GET - - PUT -``` - -You can also add custom indexes to the allow list. `allowlist.yml` doesn't support wildcards, so you must manually specify all of the indexes you want to add. - -```yml -requests: # Only allow GET requests to /sample-index1/_doc/1 and /sample-index2/_doc/1 - /sample-index1/_doc/1: - - GET - /sample-index2/_doc/1: - - GET -``` - - -## roles.yml - -This file contains any initial roles that you want to add to the Security plugin. Aside from some metadata, the default file is empty, because the Security plugin has a number of static roles that it adds automatically. - -```yml ---- -complex-role: - reserved: false - hidden: false - cluster_permissions: - - "read" - - "cluster:monitor/nodes/stats" - - "cluster:monitor/task/get" - index_permissions: - - index_patterns: - - "opensearch_dashboards_sample_data_*" - dls: "{\"match\": {\"FlightDelay\": true}}" - fls: - - "~FlightNum" - masked_fields: - - "Carrier" - allowed_actions: - - "read" - tenant_permissions: - - tenant_patterns: - - "analyst_*" - allowed_actions: - - "kibana_all_write" - static: false -_meta: - type: "roles" - config_version: 2 +# Define nodesdn mapping name and corresponding values +# cluster1: +# nodes_dn: +# - CN=*.example.com ``` - ## roles_mapping.yml ```yml @@ -359,28 +263,37 @@ kibana_server: and_backend_roles: [] ``` +## roles.yml -## action_groups.yml - -This file contains any initial action groups that you want to add to the Security plugin. - -Aside from some metadata, the default file is empty, because the Security plugin has a number of static action groups that it adds automatically. These static action groups cover a wide variety of use cases and are a great way to get started with the plugin. +This file contains any initial roles that you want to add to the Security plugin. By default, this file contains predefined roles that grant usage to plugins within the default distribution of OpenSearch. The Security plugin will also add a number static roles automatically. ```yml --- -my-action-group: +complex-role: reserved: false hidden: false - allowed_actions: - - "indices:data/write/index*" - - "indices:data/write/update*" - - "indices:admin/mapping/put" - - "indices:data/write/bulk*" + cluster_permissions: - "read" - - "write" + - "cluster:monitor/nodes/stats" + - "cluster:monitor/task/get" + index_permissions: + - index_patterns: + - "opensearch_dashboards_sample_data_*" + dls: "{\"match\": {\"FlightDelay\": true}}" + fls: + - "~FlightNum" + masked_fields: + - "Carrier" + allowed_actions: + - "read" + tenant_permissions: + - tenant_patterns: + - "analyst_*" + allowed_actions: + - "kibana_all_write" static: false _meta: - type: "actiongroups" + type: "roles" config_version: 2 ``` @@ -400,20 +313,105 @@ admin_tenant: description: "Demo tenant for admin user" ``` -## nodes_dn.yml +## opensearch.yml -`nodes_dn.yml` lets you add certificates' [distinguished names (DNs)]({{site.url}}{{site.baseurl}}/security/configuration/generate-certificates/#add-distinguished-names-to-opensearchyml) an allow list to enable communication between any number of nodes and/or clusters. For example, a node that has the DN `CN=node1.example.com` in its allow list accepts communication from any other node or certificate that uses that DN. +In addition to many OpenSearch settings, the `opensearch.yml` file contains paths to TLS certificates and their attributes, such as distinguished names and trusted certificate authorities. You can find this file in `/config/`. -The DNs get indexed into a [system index]({{site.url}}{{site.baseurl}}/security/configuration/system-indices) that only a super admin or an admin with a Transport Layer Security (TLS) certificate can access. If you want to programmatically add DNs to your allow lists, use the [REST API]({{site.url}}{{site.baseurl}}/security/access-control/api/#distinguished-names). +```yml +plugins.security.ssl.transport.pemcert_filepath: esnode.pem +plugins.security.ssl.transport.pemkey_filepath: esnode-key.pem +plugins.security.ssl.transport.pemtrustedcas_filepath: root-ca.pem +plugins.security.ssl.transport.enforce_hostname_verification: false +plugins.security.ssl.http.enabled: true +plugins.security.ssl.http.pemcert_filepath: esnode.pem +plugins.security.ssl.http.pemkey_filepath: esnode-key.pem +plugins.security.ssl.http.pemtrustedcas_filepath: root-ca.pem +plugins.security.allow_unsafe_democertificates: true +plugins.security.allow_default_init_securityindex: true +plugins.security.authcz.admin_dn: + - CN=kirk,OU=client,O=client,L=test, C=de + +plugins.security.audit.type: internal_opensearch +plugins.security.enable_snapshot_restore_privilege: true +plugins.security.check_snapshot_restore_write_privileges: true +plugins.security.cache.ttl_minutes: 60 +plugins.security.restapi.roles_enabled: ["all_access", "security_rest_api_access"] +plugins.security.system_indices.enabled: true +plugins.security.system_indices.indices: [".opendistro-alerting-config", ".opendistro-alerting-alert*", ".opendistro-anomaly-results*", ".opendistro-anomaly-detector*", ".opendistro-anomaly-checkpoints", ".opendistro-anomaly-detection-state", ".opendistro-reports-*", ".opendistro-notifications-*", ".opendistro-notebooks", ".opendistro-asynchronous-search-response*"] +node.max_local_storage_nodes: 3 +``` + +For a full list of `opensearch.yml` Security plugin settings, see [Security settings]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/security-settings/). +{: .note} + +### Refining your configuration + +The `plugins.security.allow_default_init_securityindex` setting, when set to `true`, sets the Security plugin to its default security settings if an attempt to create the security index fails when OpenSearch launches. Default security settings are stored in YAML files contained in the `opensearch-project/security/config` directory. By default, this setting is `false`. ```yml ---- -_meta: - type: "nodesdn" - config_version: 2 +plugins.security.allow_default_init_securityindex: true +``` -# Define nodesdn mapping name and corresponding values -# cluster1: -# nodes_dn: -# - CN=*.example.com +An authentication cache for the Security plugin exists to help speed up authentication by temporarily storing user objects returned from the backend so that the Security plugin is not required to make repeated requests for them. To determine how long it takes for caching to time out, you can use the `plugins.security.cache.ttl_minutes` property to set a value in minutes. The default is `60`. You can disable caching by setting the value to `0`. + +```yml +plugins.security.cache.ttl_minutes: 60 +``` + +### Enabling user access to system indexes + +Mapping a system index permission to a user allows that user to modify the system index specified in the permission's name (the one exception is the Security plugin's [system index]({{site.url}}{{site.baseurl}}/security/configuration/system-indices/)). The `plugins.security.system_indices.permission.enabled` setting provides a way for administrators to make this permission available for or hidden from role mapping. + +When set to `true`, the feature is enabled and users with permission to modify roles can create roles that include permissions that grant access to system indexes: + +```yml +plugins.security.system_indices.permission.enabled: true +``` + +When set to `false`, the permission is disabled and only admins with an admin certificate can make changes to system indexes. By default, the permission is set to `false` in a new cluster. + +To learn more about system index permissions, see [System index permissions]({{site.url}}{{site.baseurl}}/security/access-control/permissions/#system-index-permissions). + + +### Password settings + +If you want to run your users' passwords against some validation, specify a regular expression (regex) in this file. You can also include an error message that loads when passwords don't pass validation. The following example demonstrates how to include a regex so OpenSearch requires new passwords to be a minimum of eight characters with at least one uppercase, one lowercase, one digit, and one special character. + +Note that OpenSearch validates only users and passwords created through OpenSearch Dashboards or the REST API. + +```yml +plugins.security.restapi.password_validation_regex: '(?=.*[A-Z])(?=.*[^a-zA-Z\d])(?=.*[0-9])(?=.*[a-z]).{8,}' +plugins.security.restapi.password_validation_error_message: "Password must be minimum 8 characters long and must contain at least one uppercase letter, one lowercase letter, one digit, and one special character." +``` + +In addition, a score-based password strength estimator allows you to set a threshold for password strength when creating a new internal user or updating a user's password. This feature makes use of the [zxcvbn library](https://github.com/dropbox/zxcvbn) to apply a policy that emphasizes a password's complexity rather than its capacity to meet traditional criteria such as uppercase keys, numerals, and special characters. + +For information about defining users, see [Defining users]({{site.url}}{{site.baseurl}}/security/access-control/users-roles/#defining-users). + +This feature is not compatible with users specified as reserved. For information about reserved resources, see [Reserved and hidden resources]({{site.url}}{{site.baseurl}}/security/access-control/api#reserved-and-hidden-resources). +{: .important } + +Score-based password strength requires two settings to configure the feature. The following table describes the two settings. + +| Setting | Description | +| :--- | :--- | +| `plugins.security.restapi.password_min_length` | Sets the minimum number of characters for the password length. The default is `8`. This is also the minimum. | +| `plugins.security.restapi.password_score_based_validation_strength` | Sets a threshold to determine whether the password is strong or weak. There are four values that represent a threshold's increasing complexity.
`fair`--A very "guessable" password: provides protection from throttled online attacks.
`good`--A somewhat guessable password: provides protection from unthrottled online attacks.
`strong`--A safely "unguessable" password: provides moderate protection from an offline, slow-hash scenario.
`very_strong`--A very unguessable password: provides strong protection from an offline, slow-hash scenario. | + +The following example shows the settings configured for the `opensearch.yml` file and enabling a password with a minimum of 10 characters and a threshold requiring the highest strength: + +```yml +plugins.security.restapi.password_min_length: 10 +plugins.security.restapi.password_score_based_validation_strength: very_strong +``` + +When you try to create a user with a password that doesn't reach the specified threshold, the system generates a "weak password" warning, indicating that the password needs to be modified before you can save the user. + +The following example shows the response from the [Create user]({{site.url}}{{site.baseurl}}/security/access-control/api/#create-user) API when the password is weak: + +```json +{ + "status": "error", + "reason": "Weak password" +} ``` diff --git a/_tools/index.md b/_tools/index.md index 108f10da97d..c9d446a81a6 100644 --- a/_tools/index.md +++ b/_tools/index.md @@ -18,6 +18,7 @@ This section provides documentation for OpenSearch-supported tools, including: - [OpenSearch CLI](#opensearch-cli) - [OpenSearch Kubernetes operator](#opensearch-kubernetes-operator) - [OpenSearch upgrade, migration, and comparison tools](#opensearch-upgrade-migration-and-comparison-tools) +- [Sycamore](#sycamore) for AI-powered extract, transform, load (ETL) on complex documents for vector and hybrid search For information about Data Prepper, the server-side data collector for filtering, enriching, transforming, normalizing, and aggregating data for downstream analytics and visualization, see [Data Prepper]({{site.url}}{{site.baseurl}}/data-prepper/index/). @@ -122,3 +123,9 @@ The OpenSearch Kubernetes Operator is an open-source Kubernetes operator that he OpenSearch migration tools facilitate migrations to OpenSearch and upgrades to newer versions of OpenSearch. These can help you can set up a proof-of-concept environment locally using Docker containers or deploy to AWS using a one-click deployment script. This empowers you to fine-tune cluster configurations and manage workloads more effectively before migration. For more information about OpenSearch migration tools, see the documentation in the [OpenSearch Migration GitHub repository](https://github.com/opensearch-project/opensearch-migrations/tree/capture-and-replay-v0.1.0). + +## Sycamore + +[Sycamore](https://github.com/aryn-ai/sycamore) is an open-source, AI-powered document processing engine designed to prepare unstructured data for retrieval-augmented generation (RAG) and semantic search using Python. Sycamore supports chunking and enriching a wide range of complex document types, including reports, presentations, transcripts, and manuals. Additionally, Sycamore can extract and process embedded elements, such as tables, figures, graphs, and other infographics. It can then load the data into target indexes, including vector and keyword indexes, using an [OpenSearch connector](https://sycamore.readthedocs.io/en/stable/sycamore/connectors/opensearch.html). + +For more information, see [Sycamore]({{site.url}}{{site.baseurl}}/tools/sycamore/). diff --git a/_tools/sycamore.md b/_tools/sycamore.md new file mode 100644 index 00000000000..9b3986dbf32 --- /dev/null +++ b/_tools/sycamore.md @@ -0,0 +1,48 @@ +--- +layout: default +title: Sycamore +nav_order: 210 +has_children: false +--- + +# Sycamore + +[Sycamore](https://github.com/aryn-ai/sycamore) is an open-source, AI-powered document processing engine designed to prepare unstructured data for retrieval-augmented generation (RAG) and semantic search using Python. Sycamore supports chunking and enriching a wide range of complex document types, including reports, presentations, transcripts, and manuals. Additionally, Sycamore can extract and process embedded elements, such as tables, figures, graphs, and other infographics. It can then load the data into target indexes, including vector and keyword indexes, using a connector like the [OpenSearch connector](https://sycamore.readthedocs.io/en/stable/sycamore/connectors/opensearch.html). + +To get started, visit the [Sycamore documentation](https://sycamore.readthedocs.io/en/stable/sycamore/get_started.html). + +## Sycamore ETL pipeline structure + +A Sycamore extract, transform, load (ETL) pipeline applies a series of transformations to a [DocSet](https://sycamore.readthedocs.io/en/stable/sycamore/get_started/concepts.html#docsets), which is a collection of documents and their constituent elements (for example, tables, blocks of text, or headers). At the end of the pipeline, the DocSet is loaded into OpenSearch vector and keyword indexes. + +A typical pipeline for preparing unstructured data for vector or hybrid search in OpenSearch consists of the following steps: + +* Read documents into a [DocSet](https://sycamore.readthedocs.io/en/stable/sycamore/get_started/concepts.html#docsets). +* [Partition documents](https://sycamore.readthedocs.io/en/stable/sycamore/transforms/partition.html) into structured JSON elements. +* Extract metadata and filter and clean data using [transforms](https://sycamore.readthedocs.io/en/stable/sycamore/APIs/docset.html). +* Create [chunks](https://sycamore.readthedocs.io/en/stable/sycamore/transforms/merge.html) from groups of elements. +* Embed the chunks using the model of your choice. +* [Load](https://sycamore.readthedocs.io/en/stable/sycamore/connectors/opensearch.html) the embeddings, metadata, and text into OpenSearch vector and keyword indexes. + +For an example pipeline that uses this workflow, see [this notebook](https://github.com/aryn-ai/sycamore/blob/main/notebooks/opensearch_docs_etl.ipynb). + + +## Install Sycamore + +We recommend installing the Sycamore library using `pip`. The connector for OpenSearch can be specified and installed using extras. For example: + +```bash +pip install sycamore-ai[opensearch] +``` +{% include copy.html %} + +By default, Sycamore works with the Aryn Partitioning Service to process PDFs. To run inference locally for partitioning or embedding, install Sycamore with the `local-inference` extra as follows: + +```bash +pip install sycamore-ai[opensearch,local-inference] +``` +{% include copy.html %} + +## Next steps + +For more information, visit the [Sycamore documentation](https://sycamore.readthedocs.io/en/stable/sycamore/get_started.html). diff --git a/_tuning-your-cluster/availability-and-recovery/remote-store/remote-cluster-state.md b/_tuning-your-cluster/availability-and-recovery/remote-store/remote-cluster-state.md index d967aca9142..03cd1716f0c 100644 --- a/_tuning-your-cluster/availability-and-recovery/remote-store/remote-cluster-state.md +++ b/_tuning-your-cluster/availability-and-recovery/remote-store/remote-cluster-state.md @@ -67,10 +67,14 @@ The remote cluster state functionality has the following limitations: ## Remote cluster state publication - The cluster manager node processes updates to the cluster state. It then publishes the updated cluster state through the local transport layer to all of the follower nodes. With the `remote_store.publication` feature enabled, the cluster state is backed up to the remote store during every state update. The follower nodes can then fetch the state from the remote store directly, which reduces the overhead on the cluster manager node for publication. -To enable the feature flag for the `remote_store.publication` feature, follow the steps in the [experimental feature flag documentation]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/experimental/). +To enable this feature, configure the following setting in `opensearch.yml`: + +```yml +# Enable Remote cluster state publication +cluster.remote_store.publication.enabled: true +``` Enabling the setting does not change the publication flow, and follower nodes will not send acknowledgements back to the cluster manager node until they download the updated cluster state from the remote store. @@ -89,8 +93,11 @@ You do not have to use different remote store repositories for state and routing To configure remote publication, use the following cluster settings. -Setting | Default | Description -:--- | :--- | :--- -`cluster.remote_store.state.read_timeout` | 20s | The amount of time to wait for remote state download to complete on the follower node. -`cluster.remote_store.routing_table.path_type` | HASHED_PREFIX | The path type to be used for creating an index routing path in the blob store. Valid values are `FIXED`, `HASHED_PREFIX`, and `HASHED_INFIX`. -`cluster.remote_store.routing_table.path_hash_algo` | FNV_1A_BASE64 | The algorithm to be used for constructing the prefix or infix of the blob store path. This setting is applied if `cluster.remote_store.routing_table.path_type` is `hashed_prefix` or `hashed_infix`. Valid algorithm values are `FNV_1A_BASE64` and `FNV_1A_COMPOSITE_1`. +Setting | Default | Description +:--- |:---| :--- +`cluster.remote_store.state.read_timeout` | 20s | The amount of time to wait for the remote state download to complete on the follower node. +`cluster.remote_store.state.path.prefix` | "" (Empty string) | The fixed prefix to add to the index metadata files in the blob store. +`cluster.remote_store.index_metadata.path_type` | `HASHED_PREFIX` | The path type used for creating an index metadata path in the blob store. Valid values are `FIXED`, `HASHED_PREFIX`, and `HASHED_INFIX`. +`cluster.remote_store.index_metadata.path_hash_algo` | `FNV_1A_BASE64 ` | The algorithm that constructs the prefix or infix for the index metadata path in the blob store. This setting is applied if the ``cluster.remote_store.index_metadata.path_type` setting is `HASHED_PREFIX` or `HASHED_INFIX`. Valid algorithm values are `FNV_1A_BASE64` and `FNV_1A_COMPOSITE_1`. +`cluster.remote_store.routing_table.path.prefix` | "" (Empty string) | The fixed prefix to add for the index routing files in the blob store. + diff --git a/_tuning-your-cluster/availability-and-recovery/remote-store/snapshot-interoperability.md b/_tuning-your-cluster/availability-and-recovery/remote-store/snapshot-interoperability.md index 0415af65f12..e93f504be38 100644 --- a/_tuning-your-cluster/availability-and-recovery/remote-store/snapshot-interoperability.md +++ b/_tuning-your-cluster/availability-and-recovery/remote-store/snapshot-interoperability.md @@ -27,7 +27,7 @@ PUT /_snapshot/snap_repo ``` {% include copy-curl.html %} -Once enabled, all requests using the [Snapshot API]({{site.url}}{{site.baseurl}}/api-reference/snapshots/index/) will remain the same for all snapshots. After the setting is enabled, we recommend not disabling the setting. Doing so could affect data durability. +Once enabled, all requests using the [Snapshot API]({{site.url}}{{site.baseurl}}/api-reference/snapshots/index/) will remain the same for all snapshots. Therefore, do not disable the shallow snapshot setting after it has been enabled because disabling the setting could affect data durability. ## Considerations @@ -37,3 +37,43 @@ Consider the following before using shallow copy snapshots: - All nodes in the cluster must use OpenSearch 2.10 or later to take advantage of shallow copy snapshots. - The `incremental` file count and size between the current snapshot and the last snapshot is `0` when using shallow copy snapshots. - Searchable snapshots are not supported inside shallow copy snapshots. + +## Shallow snapshot v2 + +Starting with OpenSearch 2.17, the shallow snapshot feature offers an improved version called `shallow snapshot v2`, which aims to makes snapshot operations more efficient and scalable by introducing the following enhancements: + +* Deterministic snapshot operations: Shallow snapshot v2 makes snapshot operations more deterministic, ensuring consistent and predictable behavior. +* Minimized cluster state updates: Shallow snapshot v2 minimizes the number of cluster state updates required during snapshot operations, reducing overhead and improving performance. +* Scalability: Shallow snapshot v2 allows snapshot operations to scale independently of the number of shards in the cluster, enabling better performance and efficiency for large datasets. + +Shallow snapshot v2 must be enabled separately from shallow copies. + +### Enabling shallow snapshot v2 + +To enable shallow snapshot v2, enable the following repository settings: + +- `remote_store_index_shallow_copy: true` +- `shallow_snapshot_v2: true` + +The following example request creates a shallow snapshot v2 repository: + +```bash +PUT /_snapshot/snap_repo +{ +"type": "s3", +"settings": { +"bucket": "test-bucket", +"base_path": "daily-snaps", +"remote_store_index_shallow_copy": true, +"shallow_snapshot_v2": true +} +} +``` +{% include copy-curl.html %} + +### Limitations + +Shallow snapshot v2 has the following limitations: + +* Shallow snapshot v2 only supported for remote-backed indexes. +* All nodes in the cluster must use OpenSearch 2.17 or later to take advantage of shallow snapshot v2. diff --git a/_tuning-your-cluster/availability-and-recovery/snapshots/searchable_snapshot.md b/_tuning-your-cluster/availability-and-recovery/snapshots/searchable_snapshot.md index 4af25004a75..d13955f3f06 100644 --- a/_tuning-your-cluster/availability-and-recovery/snapshots/searchable_snapshot.md +++ b/_tuning-your-cluster/availability-and-recovery/snapshots/searchable_snapshot.md @@ -18,7 +18,7 @@ The searchable snapshot feature incorporates techniques like caching frequently To configure the searchable snapshots feature, create a node in your `opensearch.yml file` and define the node role as `search`. Optionally, you can also configure the `cache.size` property for the node. -A `search` node reserves storage for the cache to perform searchable snapshot queries. In the case of a dedicated search node where the node exclusively has the `search` role, this value defaults to a fixed percentage of available storage. In other cases, the value needs to be configured by the user using the `node.search.cache.size` setting. +A `search` node reserves storage for the cache to perform searchable snapshot queries. In the case of a dedicated search node where the node exclusively has the `search` role, this value defaults to a fixed percentage (80%) of available storage. In other cases, the value needs to be configured by the user using the `node.search.cache.size` setting. Parameter | Type | Description :--- | :--- | :--- @@ -108,4 +108,5 @@ The following are known limitations of the searchable snapshots feature: - Many remote object stores charge on a per-request basis for retrieval, so users should closely monitor any costs incurred. - Searching remote data can impact the performance of other queries running on the same node. We recommend that users provision dedicated nodes with the `search` role for performance-critical applications. - For better search performance, consider [force merging]({{site.url}}{{site.baseurl}}/api-reference/index-apis/force-merge/) indexes into a smaller number of segments before taking a snapshot. For the best performance, at the cost of using compute resources prior to snapshotting, force merge your index into one segment. -- We recommend configuring a maximum ratio of remote data to local disk cache size using the `cluster.filecache.remote_data_ratio` setting. A ratio of 5 is a good starting point for most workloads to ensure good query performance. If the ratio is too large, then there may not be sufficient disk space to handle the search workload. For more details on the maximum ratio of remote data, see issue [#11676](https://github.com/opensearch-project/OpenSearch/issues/11676). +- We recommend configuring a maximum ratio of remote data to local disk cache size using the `cluster.filecache.remote_data_ratio` setting. A ratio of 5 is a good starting point for most workloads to ensure good query performance. If the ratio is too large, then there may not be sufficient disk space to handle the search workload. For more details on the maximum ratio of remote data, see issue [#11676](https://github.com/opensearch-project/OpenSearch/issues/11676). +- k-NN native-engine-based indexes using `faiss` and `nmslib` engines are incompatible with searchable snapshots. diff --git a/_tuning-your-cluster/index.md b/_tuning-your-cluster/index.md index 99db78565fc..fa0973395fc 100644 --- a/_tuning-your-cluster/index.md +++ b/_tuning-your-cluster/index.md @@ -192,11 +192,27 @@ To better understand and monitor your cluster, use the [CAT API]({{site.url}}{{s ## (Advanced) Step 6: Configure shard allocation awareness or forced awareness +To further fine-tune your shard allocation, you can set custom node attributes for shard allocation awareness or forced awareness. + ### Shard allocation awareness -If your nodes are spread across several geographical zones, you can configure shard allocation awareness to allocate all replica shards to a zone that’s different from their primary shard. +You can set custom node attributes on OpenSearch nodes to be used for shard allocation awareness. For example, you can set the `zone` attribute on each node to represent the zone in which the node is located. You can also use the `zone` attribute to ensure that the primary shard and its replica shards are allocated in a balanced manner across available, distinct zones. In this scenario, maximum shard copies per zone would equal `ceil (number_of_shard_copies/number_of_distinct_zones)`. + +OpenSearch, by default, allocates shard copies of a single shard across different nodes. When only 1 zone is available, such as after a zone failure, OpenSearch allocates replica shards to the only remaining zone---it considers only available zones (attribute values) when calculating the maximum number of allowed shard copies per zone. + +For example, if your index has a total of 5 shard copies (1 primary and 4 replicas) and nodes in 3 distinct zones, then OpenSearch will perform the following to allocate all 5 shard copies: + +- Allocate no more than 2 shards per zone, which will require at least 2 nodes in 2 zones. +- Allocate the last shard in the third zone, with at least 1 node needed in the third zone. -With shard allocation awareness, if the nodes in one of your zones fail, you can be assured that your replica shards are spread across your other zones. It adds a layer of fault tolerance to ensure your data survives a zone failure beyond just individual node failures. +Alternatively, if you have 3 nodes in the first zone and 1 node in each remaining zone, then OpenSearch will allocate: + +- 2 shard copies in the first zone. +- 1 shard copy in the remaining 2 zones. + +The final shard copy will remain unallocated due to the lack of nodes. + +With shard allocation awareness, if the nodes in one of your zones fail, you can be assured that your replica shards are spread across your other zones, adding a layer of fault tolerance to ensure that your data survives zone failures. To configure shard allocation awareness, add zone attributes to `opensearch-d1` and `opensearch-d2`, respectively: @@ -219,6 +235,8 @@ PUT _cluster/settings } ``` +You can also use multiple attributes for shard allocation awareness by providing the attributes as a comma-separated string, for example, `zone,rack`. + You can either use `persistent` or `transient` settings. We recommend the `persistent` setting because it persists through a cluster reboot. Transient settings don't persist through a cluster reboot. Shard allocation awareness attempts to separate primary and replica shards across multiple zones. However, if only one zone is available (such as after a zone failure), OpenSearch allocates replica shards to the only remaining zone. diff --git a/assets/js/search.js b/assets/js/search.js index 8d9cab2ec56..4d4fce62f3a 100644 --- a/assets/js/search.js +++ b/assets/js/search.js @@ -319,7 +319,7 @@ window.doResultsPageSearch = async (query, type, version) => { searchResultsContainer.appendChild(resultElement); const breakline = document.createElement('hr'); - breakline.style.border = '.5px solid #ccc'; + breakline.style.borderTop = '.5px solid #ccc'; breakline.style.margin = 'auto'; searchResultsContainer.appendChild(breakline); }); diff --git a/build.sh b/build.sh index 060bbfa6666..85ef617931a 100755 --- a/build.sh +++ b/build.sh @@ -1,3 +1,9 @@ #!/usr/bin/env bash -JEKYLL_LINK_CHECKER=internal bundle exec jekyll serve --host localhost --port 4000 --incremental --livereload --open-url --trace +host="localhost" + +if [[ "$DOCKER_BUILD" == "true" ]]; then + host="0.0.0.0" +fi + +JEKYLL_LINK_CHECKER=internal bundle exec jekyll serve --host ${host} --port 4000 --incremental --livereload --open-url --trace diff --git a/docker-compose.dev.yml b/docker-compose.dev.yml new file mode 100644 index 00000000000..04dd007db96 --- /dev/null +++ b/docker-compose.dev.yml @@ -0,0 +1,14 @@ +version: "3" + +services: + doc_builder: + image: ruby:3.2.4 + volumes: + - .:/app + working_dir: /app + ports: + - "4000:4000" + command: bash -c "bundler install && bash build.sh" + environment: + BUNDLE_PATH: /app/vendor/bundle # Avoid installing gems globally. + DOCKER_BUILD: true # Signify build.sh to bind to 0.0.0.0 for effective doc access from host. diff --git a/images/dashboards/data-source-UI.png b/images/dashboards/data-source-UI.png deleted file mode 100644 index bc072378476..00000000000 Binary files a/images/dashboards/data-source-UI.png and /dev/null differ diff --git a/images/dashboards/delete-data-source.png b/images/dashboards/delete-data-source.png deleted file mode 100644 index 2d0337a92b8..00000000000 Binary files a/images/dashboards/delete-data-source.png and /dev/null differ diff --git a/release-notes/opensearch-documentation-release-notes-2.17.0.md b/release-notes/opensearch-documentation-release-notes-2.17.0.md new file mode 100644 index 00000000000..d9ed51737cb --- /dev/null +++ b/release-notes/opensearch-documentation-release-notes-2.17.0.md @@ -0,0 +1,36 @@ +# OpenSearch Documentation Website 2.17.0 Release Notes + +The OpenSearch 2.17.0 documentation includes the following additions and updates. + +## New documentation for 2.17.0 + +- Get offline batch inference details using task API in m [#8305](https://github.com/opensearch-project/documentation-website/pull/8305) +- Documentation for Binary Quantization Support with KNN Vector Search [#8281](https://github.com/opensearch-project/documentation-website/pull/8281) +- add offline batch ingestion tech doc [#8251](https://github.com/opensearch-project/documentation-website/pull/8251) +- Add documentation changes for disk-based k-NN [#8246](https://github.com/opensearch-project/documentation-website/pull/8246) +- Derived field updates for 2.17 [#8244](https://github.com/opensearch-project/documentation-website/pull/8244) +- Add changes for multiple signing keys [#8243](https://github.com/opensearch-project/documentation-website/pull/8243) +- Add documentation changes for Snapshot Status API [#8235](https://github.com/opensearch-project/documentation-website/pull/8235) +- Update flow framework additional fields in previous_node_inputs [#8233](https://github.com/opensearch-project/documentation-website/pull/8233) +- Add documentation changes for shallow snapshot v2 [#8207](https://github.com/opensearch-project/documentation-website/pull/8207) +- Add documentation for context and ABC templates [#8197](https://github.com/opensearch-project/documentation-website/pull/8197) +- Create documentation for snapshots with hashed prefix path type [#8196](https://github.com/opensearch-project/documentation-website/pull/8196) +- Adding documentation for remote index use in AD [#8191](https://github.com/opensearch-project/documentation-website/pull/8191) +- Doc update for concurrent search [#8181](https://github.com/opensearch-project/documentation-website/pull/8181) +- Adding new cluster search setting docs [#8180](https://github.com/opensearch-project/documentation-website/pull/8180) +- Add new settings for remote publication [#8176](https://github.com/opensearch-project/documentation-website/pull/8176) +- Grouping Top N queries documentation [#8173](https://github.com/opensearch-project/documentation-website/pull/8173) +- Document reprovision param for Update Workflow API [#8172](https://github.com/opensearch-project/documentation-website/pull/8172) +- Add documentation for Faiss byte vector [#8170](https://github.com/opensearch-project/documentation-website/pull/8170) +- Terms query can accept encoded terms input as bitmap [#8133](https://github.com/opensearch-project/documentation-website/pull/8133) +- Update doc for adding new param in cat shards action for cancellation… [#8127](https://github.com/opensearch-project/documentation-website/pull/8127) +- Add docs on skip_validating_missing_parameters in ml-commons connector [#8118](https://github.com/opensearch-project/documentation-website/pull/8118) +- Add Split Response Processor to 2.17 Search Pipeline docs [#8081](https://github.com/opensearch-project/documentation-website/pull/8081) +- Added documentation for FGAC for Flow Framework [#8076](https://github.com/opensearch-project/documentation-website/pull/8076) +- Remove composite agg limitations for concurrent search [#7904](https://github.com/opensearch-project/documentation-website/pull/7904) +- Add doc for nodes stats search.request.took fields [#7887](https://github.com/opensearch-project/documentation-website/pull/7887) +- Add documentation for ignore_hosts config option for ip-based rate limiting [#7859](https://github.com/opensearch-project/documentation-website/pull/7859) + +## Documentation for 2.17.0 experimental features + +- Document new experimental ingestion streaming APIs [#8123](https://github.com/opensearch-project/documentation-website/pull/8123)