diff --git a/_analyzers/token-filters/index.md b/_analyzers/token-filters/index.md
index 003b275782..14abeab567 100644
--- a/_analyzers/token-filters/index.md
+++ b/_analyzers/token-filters/index.md
@@ -31,7 +31,7 @@ Token filter | Underlying Lucene token filter| Description
[`fingerprint`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/fingerprint/) | [FingerprintFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/FingerprintFilter.html) | Sorts and deduplicates the token list and concatenates tokens into a single token.
`flatten_graph` | [FlattenGraphFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/FlattenGraphFilter.html) | Flattens a token graph produced by a graph token filter, such as `synonym_graph` or `word_delimiter_graph`, making the graph suitable for indexing.
[`hunspell`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/hunspell/) | [HunspellStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/hunspell/HunspellStemFilter.html) | Uses [Hunspell](https://en.wikipedia.org/wiki/Hunspell) rules to stem tokens. Because Hunspell allows a word to have multiple stems, this filter can emit multiple tokens for each consumed token. Requires the configuration of one or more language-specific Hunspell dictionaries.
-[`hyphenation_decompounder`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/hyphenation-decompounder) | [HyphenationCompoundWordTokenFilter](https://lucene.apache.org/core/9_8_0/analysis/common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilter.html) | Uses XML-based hyphenation patterns to find potential subwords in compound words and checks the subwords against the specified word list. The token output contains only the subwords found in the word list.
+[`hyphenation_decompounder`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/hyphenation-decompounder/) | [HyphenationCompoundWordTokenFilter](https://lucene.apache.org/core/9_8_0/analysis/common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilter.html) | Uses XML-based hyphenation patterns to find potential subwords in compound words and checks the subwords against the specified word list. The token output contains only the subwords found in the word list.
[`keep_types`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/keep-types/) | [TypeTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/TypeTokenFilter.html) | Keeps or removes tokens of a specific type.
[`keep_words`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/keep-words/) | [KeepWordFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeepWordFilter.html) | Checks the tokens against the specified word list and keeps only those that are in the list.
[`keyword_marker`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/keyword-marker/) | [KeywordMarkerFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordMarkerFilter.html) | Marks specified tokens as keywords, preventing them from being stemmed.
@@ -41,24 +41,24 @@ Token filter | Underlying Lucene token filter| Description
`length` | [LengthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/LengthFilter.html) | Removes tokens whose lengths are shorter or longer than the length range specified by `min` and `max`.
`limit` | [LimitTokenCountFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/LimitTokenCountFilter.html) | Limits the number of output tokens. A common use case is to limit the size of document field values based on token count.
`lowercase` | [LowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/LowerCaseFilter.html) | Converts tokens to lowercase. The default [LowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/LowerCaseFilter.html) is for the English language. You can set the `language` parameter to `greek` (uses [GreekLowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/el/GreekLowerCaseFilter.html)), `irish` (uses [IrishLowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ga/IrishLowerCaseFilter.html)), or `turkish` (uses [TurkishLowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/tr/TurkishLowerCaseFilter.html)).
-`min_hash` | [MinHashFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/minhash/MinHashFilter.html) | Uses the [MinHash technique](https://en.wikipedia.org/wiki/MinHash) to estimate document similarity. Performs the following operations on a token stream sequentially:
1. Hashes each token in the stream.
2. Assigns the hashes to buckets, keeping only the smallest hashes of each bucket.
3. Outputs the smallest hash from each bucket as a token stream.
-`multiplexer` | N/A | Emits multiple tokens at the same position. Runs each token through each of the specified filter lists separately and outputs the results as separate tokens.
-`ngram` | [NGramTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ngram/NGramTokenFilter.html) | Tokenizes the given token into n-grams of lengths between `min_gram` and `max_gram`.
-Normalization | `arabic_normalization`: [ArabicNormalizer](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ar/ArabicNormalizer.html)
`german_normalization`: [GermanNormalizationFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/de/GermanNormalizationFilter.html)
`hindi_normalization`: [HindiNormalizer](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/hi/HindiNormalizer.html)
`indic_normalization`: [IndicNormalizer](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/in/IndicNormalizer.html)
`sorani_normalization`: [SoraniNormalizer](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ckb/SoraniNormalizer.html)
`persian_normalization`: [PersianNormalizer](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/fa/PersianNormalizer.html)
`scandinavian_normalization` : [ScandinavianNormalizationFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ScandinavianNormalizationFilter.html)
`scandinavian_folding`: [ScandinavianFoldingFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ScandinavianFoldingFilter.html)
`serbian_normalization`: [SerbianNormalizationFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/sr/SerbianNormalizationFilter.html) | Normalizes the characters of one of the listed languages.
-`pattern_capture` | N/A | Generates a token for every capture group in the provided regular expression. Uses [Java regular expression syntax](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html).
-`pattern_replace` | N/A | Matches a pattern in the provided regular expression and replaces matching substrings. Uses [Java regular expression syntax](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html).
-`phonetic` | N/A | Uses a phonetic encoder to emit a metaphone token for each token in the token stream. Requires installing the `analysis-phonetic` plugin.
-`porter_stem` | [PorterStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/en/PorterStemFilter.html) | Uses the [Porter stemming algorithm](https://tartarus.org/martin/PorterStemmer/) to perform algorithmic stemming for the English language.
-`predicate_token_filter` | N/A | Removes tokens that don’t match the specified predicate script. Supports inline Painless scripts only.
-`remove_duplicates` | [RemoveDuplicatesTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/RemoveDuplicatesTokenFilter.html) | Removes duplicate tokens that are in the same position.
+[`min_hash`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/min-hash/) | [MinHashFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/minhash/MinHashFilter.html) | Uses the [MinHash technique](https://en.wikipedia.org/wiki/MinHash) to estimate document similarity. Performs the following operations on a token stream sequentially:
1. Hashes each token in the stream.
2. Assigns the hashes to buckets, keeping only the smallest hashes of each bucket.
3. Outputs the smallest hash from each bucket as a token stream.
+[`multiplexer`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/multiplexer/) | N/A | Emits multiple tokens at the same position. Runs each token through each of the specified filter lists separately and outputs the results as separate tokens.
+[`ngram`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/ngram/) | [NGramTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ngram/NGramTokenFilter.html) | Tokenizes the given token into n-grams of lengths between `min_gram` and `max_gram`.
+[Normalization]({{site.url}}{{site.baseurl}}/analyzers/token-filters/normalization/) | `arabic_normalization`: [ArabicNormalizer](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ar/ArabicNormalizer.html)
`german_normalization`: [GermanNormalizationFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/de/GermanNormalizationFilter.html)
`hindi_normalization`: [HindiNormalizer](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/hi/HindiNormalizer.html)
`indic_normalization`: [IndicNormalizer](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/in/IndicNormalizer.html)
`sorani_normalization`: [SoraniNormalizer](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ckb/SoraniNormalizer.html)
`persian_normalization`: [PersianNormalizer](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/fa/PersianNormalizer.html)
`scandinavian_normalization` : [ScandinavianNormalizationFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ScandinavianNormalizationFilter.html)
`scandinavian_folding`: [ScandinavianFoldingFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ScandinavianFoldingFilter.html)
`serbian_normalization`: [SerbianNormalizationFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/sr/SerbianNormalizationFilter.html) | Normalizes the characters of one of the listed languages.
+[`pattern_capture`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/pattern-capture/) | N/A | Generates a token for every capture group in the provided regular expression. Uses [Java regular expression syntax](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html).
+[`pattern_replace`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/pattern-replace/) | N/A | Matches a pattern in the provided regular expression and replaces matching substrings. Uses [Java regular expression syntax](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html).
+[`phonetic`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/phonetic/) | N/A | Uses a phonetic encoder to emit a metaphone token for each token in the token stream. Requires installing the `analysis-phonetic` plugin.
+[`porter_stem`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/porter-stem/) | [PorterStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/en/PorterStemFilter.html) | Uses the [Porter stemming algorithm](https://tartarus.org/martin/PorterStemmer/) to perform algorithmic stemming for the English language.
+[`predicate_token_filter`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/predicate-token-filter/) | N/A | Removes tokens that do not match the specified predicate script. Supports only inline Painless scripts.
+[`remove_duplicates`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/remove-duplicates/) | [RemoveDuplicatesTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/RemoveDuplicatesTokenFilter.html) | Removes duplicate tokens that are in the same position.
`reverse` | [ReverseStringFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/reverse/ReverseStringFilter.html) | Reverses the string corresponding to each token in the token stream. For example, the token `dog` becomes `god`.
`shingle` | [ShingleFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/shingle/ShingleFilter.html) | Generates shingles of lengths between `min_shingle_size` and `max_shingle_size` for tokens in the token stream. Shingles are similar to n-grams but apply to words instead of letters. For example, two-word shingles added to the list of unigrams [`contribute`, `to`, `opensearch`] are [`contribute to`, `to opensearch`].
`snowball` | N/A | Stems words using a [Snowball-generated stemmer](https://snowballstem.org/). You can use the `snowball` token filter with the following languages in the `language` field: `Arabic`, `Armenian`, `Basque`, `Catalan`, `Danish`, `Dutch`, `English`, `Estonian`, `Finnish`, `French`, `German`, `German2`, `Hungarian`, `Irish`, `Italian`, `Kp`, `Lithuanian`, `Lovins`, `Norwegian`, `Porter`, `Portuguese`, `Romanian`, `Russian`, `Spanish`, `Swedish`, `Turkish`.
`stemmer` | N/A | Provides algorithmic stemming for the following languages in the `language` field: `arabic`, `armenian`, `basque`, `bengali`, `brazilian`, `bulgarian`, `catalan`, `czech`, `danish`, `dutch`, `dutch_kp`, `english`, `light_english`, `lovins`, `minimal_english`, `porter2`, `possessive_english`, `estonian`, `finnish`, `light_finnish`, `french`, `light_french`, `minimal_french`, `galician`, `minimal_galician`, `german`, `german2`, `light_german`, `minimal_german`, `greek`, `hindi`, `hungarian`, `light_hungarian`, `indonesian`, `irish`, `italian`, `light_italian`, `latvian`, `Lithuanian`, `norwegian`, `light_norwegian`, `minimal_norwegian`, `light_nynorsk`, `minimal_nynorsk`, `portuguese`, `light_portuguese`, `minimal_portuguese`, `portuguese_rslp`, `romanian`, `russian`, `light_russian`, `sorani`, `spanish`, `light_spanish`, `swedish`, `light_swedish`, `turkish`.
`stemmer_override` | N/A | Overrides stemming algorithms by applying a custom mapping so that the provided terms are not stemmed.
`stop` | [StopFilter](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/analysis/StopFilter.html) | Removes stop words from a token stream.
-`synonym` | N/A | Supplies a synonym list for the analysis process. The synonym list is provided using a configuration file.
-`synonym_graph` | N/A | Supplies a synonym list, including multiword synonyms, for the analysis process.
+[`synonym`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/synonym/) | N/A | Supplies a synonym list for the analysis process. The synonym list is provided using a configuration file.
+[`synonym_graph`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/synonym-graph/) | N/A | Supplies a synonym list, including multiword synonyms, for the analysis process.
`trim` | [TrimFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/TrimFilter.html) | Trims leading and trailing white space from each token in a stream.
`truncate` | [TruncateTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/TruncateTokenFilter.html) | Truncates tokens whose length exceeds the specified character limit.
`unique` | N/A | Ensures each token is unique by removing duplicate tokens from a stream.
diff --git a/_analyzers/token-filters/min-hash.md b/_analyzers/token-filters/min-hash.md
new file mode 100644
index 0000000000..e4f1a8da91
--- /dev/null
+++ b/_analyzers/token-filters/min-hash.md
@@ -0,0 +1,138 @@
+---
+layout: default
+title: Min hash
+parent: Token filters
+nav_order: 270
+---
+
+# Min hash token filter
+
+The `min_hash` token filter is used to generate hashes for tokens based on a [MinHash](https://en.wikipedia.org/wiki/MinHash) approximation algorithm, which is useful for detecting similarity between documents. The `min_hash` token filter generates hashes for a set of tokens (typically from an analyzed field).
+
+## Parameters
+
+The `min_hash` token filter can be configured with the following parameters.
+
+Parameter | Required/Optional | Data type | Description
+:--- | :--- | :--- | :---
+`hash_count` | Optional | Integer | The number of hash values to generate for each token. Increasing this value generally improves the accuracy of similarity estimation but increases the computational cost. Default is `1`.
+`bucket_count` | Optional | Integer | The number of hash buckets to use. This affects the granularity of the hashing. A larger number of buckets provides finer granularity and reduces hash collisions but requires more memory. Default is `512`.
+`hash_set_size` | Optional | Integer | The number of hashes to retain in each bucket. This can influence the hashing quality. Larger set sizes may lead to better similarity detection but consume more memory. Default is `1`.
+`with_rotation` | Optional | Boolean | When set to `true`, the filter populates empty buckets with the value from the first non-empty bucket found to its circular right, provided that the `hash_set_size` is `1`. If the `bucket_count` argument exceeds `1`, this setting automatically defaults to `true`; otherwise, it defaults to `false`.
+
+## Example
+
+The following example request creates a new index named `minhash_index` and configures an analyzer with a `min_hash` filter:
+
+```json
+PUT /minhash_index
+{
+ "settings": {
+ "analysis": {
+ "filter": {
+ "minhash_filter": {
+ "type": "min_hash",
+ "hash_count": 3,
+ "bucket_count": 512,
+ "hash_set_size": 1,
+ "with_rotation": false
+ }
+ },
+ "analyzer": {
+ "minhash_analyzer": {
+ "type": "custom",
+ "tokenizer": "standard",
+ "filter": [
+ "minhash_filter"
+ ]
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+POST /minhash_index/_analyze
+{
+ "analyzer": "minhash_analyzer",
+ "text": "OpenSearch is very powerful."
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens (the tokens are not human readable because they represent hashes):
+
+```json
+{
+ "tokens" : [
+ {
+ "token" : "\u0000\u0000㳠锯ੲ걌䐩䉵",
+ "start_offset" : 0,
+ "end_offset" : 27,
+ "type" : "MIN_HASH",
+ "position" : 0
+ },
+ {
+ "token" : "\u0000\u0000㳠锯ੲ걌䐩䉵",
+ "start_offset" : 0,
+ "end_offset" : 27,
+ "type" : "MIN_HASH",
+ "position" : 0
+ },
+ ...
+```
+
+In order to demonstrate the usefulness of the `min_hash` token filter, you can use the following Python script to compare the two strings using the previously created analyzer:
+
+```python
+from opensearchpy import OpenSearch
+from requests.auth import HTTPBasicAuth
+
+# Initialize the OpenSearch client with authentication
+host = 'https://localhost:9200' # Update if using a different host/port
+auth = ('admin', 'admin') # Username and password
+
+# Create the OpenSearch client with SSL verification turned off
+client = OpenSearch(
+ hosts=[host],
+ http_auth=auth,
+ use_ssl=True,
+ verify_certs=False, # Disable SSL certificate validation
+ ssl_show_warn=False # Suppress SSL warnings in the output
+)
+
+# Analyzes text and returns the minhash tokens
+def analyze_text(index, text):
+ response = client.indices.analyze(
+ index=index,
+ body={
+ "analyzer": "minhash_analyzer",
+ "text": text
+ }
+ )
+ return [token['token'] for token in response['tokens']]
+
+# Analyze two similar texts
+tokens_1 = analyze_text('minhash_index', 'OpenSearch is a powerful search engine.')
+tokens_2 = analyze_text('minhash_index', 'OpenSearch is a very powerful search engine.')
+
+# Calculate Jaccard similarity
+set_1 = set(tokens_1)
+set_2 = set(tokens_2)
+shared_tokens = set_1.intersection(set_2)
+jaccard_similarity = len(shared_tokens) / len(set_1.union(set_2))
+
+print(f"Jaccard Similarity: {jaccard_similarity}")
+```
+
+The response should contain the Jaccard similarity score:
+
+```yaml
+Jaccard Similarity: 0.8571428571428571
+```
\ No newline at end of file
diff --git a/_analyzers/token-filters/multiplexer.md b/_analyzers/token-filters/multiplexer.md
new file mode 100644
index 0000000000..21597b7fc1
--- /dev/null
+++ b/_analyzers/token-filters/multiplexer.md
@@ -0,0 +1,165 @@
+---
+layout: default
+title: Multiplexer
+parent: Token filters
+nav_order: 280
+---
+
+# Multiplexer token filter
+
+The `multiplexer` token filter allows you to create multiple versions of the same token by applying different filters. This is useful when you want to analyze the same token in multiple ways. For example, you may want to analyze a token using different stemming, synonyms, or n-gram filters and use all of the generated tokens together. This token filter works by duplicating the token stream and applying different filters to each copy.
+
+The `multiplexer` token filter removes duplicate tokens from the token stream.
+{: .important}
+
+The `multiplexer` token filter does not support multiword `synonym` or `synonym_graph` token filters or `shingle` token filters because they need to analyze not only the current token but also upcoming tokens in order to determine how to transform the input correctly.
+{: .important}
+
+## Parameters
+
+The `multiplexer` token filter can be configured with the following parameters.
+
+Parameter | Required/Optional | Data type | Description
+:--- | :--- | :--- | :---
+`filters` | Optional | List of strings | A comma-separated list of token filters to apply to each copy of the token stream. Default is an empty list.
+`preserve_original` | Optional | Boolean | Whether to keep the original token as one of the outputs. Default is `true`.
+
+## Example
+
+The following example request creates a new index named `multiplexer_index` and configures an analyzer with a `multiplexer` filter:
+
+```json
+PUT /multiplexer_index
+{
+ "settings": {
+ "analysis": {
+ "filter": {
+ "english_stemmer": {
+ "type": "stemmer",
+ "name": "english"
+ },
+ "synonym_filter": {
+ "type": "synonym",
+ "synonyms": [
+ "quick,fast"
+ ]
+ },
+ "multiplexer_filter": {
+ "type": "multiplexer",
+ "filters": ["english_stemmer", "synonym_filter"],
+ "preserve_original": true
+ }
+ },
+ "analyzer": {
+ "multiplexer_analyzer": {
+ "type": "custom",
+ "tokenizer": "standard",
+ "filter": [
+ "multiplexer_filter"
+ ]
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+POST /multiplexer_index/_analyze
+{
+ "analyzer": "multiplexer_analyzer",
+ "text": "The slow turtle hides from the quick dog"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "The",
+ "start_offset": 0,
+ "end_offset": 3,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "slow",
+ "start_offset": 4,
+ "end_offset": 8,
+ "type": "",
+ "position": 1
+ },
+ {
+ "token": "turtle",
+ "start_offset": 9,
+ "end_offset": 15,
+ "type": "",
+ "position": 2
+ },
+ {
+ "token": "turtl",
+ "start_offset": 9,
+ "end_offset": 15,
+ "type": "",
+ "position": 2
+ },
+ {
+ "token": "hides",
+ "start_offset": 16,
+ "end_offset": 21,
+ "type": "",
+ "position": 3
+ },
+ {
+ "token": "hide",
+ "start_offset": 16,
+ "end_offset": 21,
+ "type": "",
+ "position": 3
+ },
+ {
+ "token": "from",
+ "start_offset": 22,
+ "end_offset": 26,
+ "type": "",
+ "position": 4
+ },
+ {
+ "token": "the",
+ "start_offset": 27,
+ "end_offset": 30,
+ "type": "",
+ "position": 5
+ },
+ {
+ "token": "quick",
+ "start_offset": 31,
+ "end_offset": 36,
+ "type": "",
+ "position": 6
+ },
+ {
+ "token": "fast",
+ "start_offset": 31,
+ "end_offset": 36,
+ "type": "SYNONYM",
+ "position": 6
+ },
+ {
+ "token": "dog",
+ "start_offset": 37,
+ "end_offset": 40,
+ "type": "",
+ "position": 7
+ }
+ ]
+}
+```
diff --git a/_analyzers/token-filters/ngram.md b/_analyzers/token-filters/ngram.md
new file mode 100644
index 0000000000..c029eac26e
--- /dev/null
+++ b/_analyzers/token-filters/ngram.md
@@ -0,0 +1,137 @@
+---
+layout: default
+title: N-gram
+parent: Token filters
+nav_order: 290
+---
+
+# N-gram token filter
+
+The `ngram` token filter is a powerful tool used to break down text into smaller components, known as _n-grams_, which can improve partial matching and fuzzy search capabilities. It works by splitting a token into smaller substrings of defined lengths. These filters are commonly used in search applications to support autocomplete, partial matches, and typo-tolerant search. For more information, see [Autocomplete functionality]({{site.url}}{{site.baseurl}}/search-plugins/searching-data/autocomplete/) and [Did-you-mean]({{site.url}}{{site.baseurl}}/search-plugins/searching-data/did-you-mean/).
+
+## Parameters
+
+The `ngram` token filter can be configured with the following parameters.
+
+Parameter | Required/Optional | Data type | Description
+:--- | :--- | :--- | :---
+`min_gram` | Optional | Integer | The minimum length of the n-grams. Default is `1`.
+`max_gram` | Optional | Integer | The maximum length of the n-grams. Default is `2`.
+`preserve_original` | Optional | Boolean | Whether to keep the original token as one of the outputs. Default is `false`.
+
+## Example
+
+The following example request creates a new index named `ngram_example_index` and configures an analyzer with an `ngram` filter:
+
+```json
+PUT /ngram_example_index
+{
+ "settings": {
+ "analysis": {
+ "filter": {
+ "ngram_filter": {
+ "type": "ngram",
+ "min_gram": 2,
+ "max_gram": 3
+ }
+ },
+ "analyzer": {
+ "ngram_analyzer": {
+ "type": "custom",
+ "tokenizer": "standard",
+ "filter": [
+ "lowercase",
+ "ngram_filter"
+ ]
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+POST /ngram_example_index/_analyze
+{
+ "analyzer": "ngram_analyzer",
+ "text": "Search"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "se",
+ "start_offset": 0,
+ "end_offset": 6,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "sea",
+ "start_offset": 0,
+ "end_offset": 6,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "ea",
+ "start_offset": 0,
+ "end_offset": 6,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "ear",
+ "start_offset": 0,
+ "end_offset": 6,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "ar",
+ "start_offset": 0,
+ "end_offset": 6,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "arc",
+ "start_offset": 0,
+ "end_offset": 6,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "rc",
+ "start_offset": 0,
+ "end_offset": 6,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "rch",
+ "start_offset": 0,
+ "end_offset": 6,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "ch",
+ "start_offset": 0,
+ "end_offset": 6,
+ "type": "",
+ "position": 0
+ }
+ ]
+}
+```
diff --git a/_analyzers/token-filters/normalization.md b/_analyzers/token-filters/normalization.md
new file mode 100644
index 0000000000..1be08e65c2
--- /dev/null
+++ b/_analyzers/token-filters/normalization.md
@@ -0,0 +1,88 @@
+---
+layout: default
+title: Normalization
+parent: Token filters
+nav_order: 300
+---
+
+# Normalization token filter
+
+The `normalization` token filter is designed to adjust and simplify text in a way that reduces variations, particularly variations in special characters. It is primarily used to handle variations in writing by standardizing characters in specific languages.
+
+The following `normalization` token filters are available:
+
+- [arabic_normalization](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/ar/ArabicNormalizer.html)
+- [german_normalization](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/de/GermanNormalizationFilter.html)
+- [hindi_normalization](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/hi/HindiNormalizer.html)
+- [indic_normalization](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/in/IndicNormalizer.html)
+- [sorani_normalization](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/ckb/SoraniNormalizer.html)
+- [persian_normalization](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/fa/PersianNormalizer.html)
+- [scandinavian_normalization](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/ScandinavianNormalizationFilter.html)
+- [scandinavian_folding](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/ScandinavianFoldingFilter.html)
+- [serbian_normalization](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/sr/SerbianNormalizationFilter.html)
+
+
+## Example
+
+The following example request creates a new index named `german_normalizer_example` and configures an analyzer with a `german_normalization` filter:
+
+```json
+PUT /german_normalizer_example
+{
+ "settings": {
+ "analysis": {
+ "filter": {
+ "german_normalizer": {
+ "type": "german_normalization"
+ }
+ },
+ "analyzer": {
+ "german_normalizer_analyzer": {
+ "tokenizer": "standard",
+ "filter": [
+ "lowercase",
+ "german_normalizer"
+ ]
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+POST /german_normalizer_example/_analyze
+{
+ "text": "Straße München",
+ "analyzer": "german_normalizer_analyzer"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "strasse",
+ "start_offset": 0,
+ "end_offset": 6,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "munchen",
+ "start_offset": 7,
+ "end_offset": 14,
+ "type": "",
+ "position": 1
+ }
+ ]
+}
+```
diff --git a/_analyzers/token-filters/pattern-capture.md b/_analyzers/token-filters/pattern-capture.md
new file mode 100644
index 0000000000..cff36b583d
--- /dev/null
+++ b/_analyzers/token-filters/pattern-capture.md
@@ -0,0 +1,97 @@
+---
+layout: default
+title: Pattern capture
+parent: Token filters
+nav_order: 310
+---
+
+# Pattern capture token filter
+
+The `pattern_capture` token filter is a powerful filter that uses regular expressions to capture and extract parts of text according to specific patterns. This filter can be useful when you want to extract particular parts of tokens, such as email domains, hashtags, or numbers, and reuse them for further analysis or indexing.
+
+## Parameters
+
+The `pattern_capture` token filter can be configured with the following parameters.
+
+Parameter | Required/Optional | Data type | Description
+:--- | :--- | :--- | :---
+`patterns` | Required | Array of strings | An array of regular expressions used to capture parts of text.
+`preserve_original` | Required | Boolean| Whether to keep the original token in the output. Default is `true`.
+
+
+## Example
+
+The following example request creates a new index named `email_index` and configures an analyzer with a `pattern_capture` filter to extract the local part and domain name from an email address:
+
+```json
+PUT /email_index
+{
+ "settings": {
+ "analysis": {
+ "filter": {
+ "email_pattern_capture": {
+ "type": "pattern_capture",
+ "preserve_original": true,
+ "patterns": [
+ "^([^@]+)",
+ "@(.+)$"
+ ]
+ }
+ },
+ "analyzer": {
+ "email_analyzer": {
+ "tokenizer": "uax_url_email",
+ "filter": [
+ "email_pattern_capture",
+ "lowercase"
+ ]
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+POST /email_index/_analyze
+{
+ "text": "john.doe@example.com",
+ "analyzer": "email_analyzer"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "john.doe@example.com",
+ "start_offset": 0,
+ "end_offset": 20,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "john.doe",
+ "start_offset": 0,
+ "end_offset": 20,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "example.com",
+ "start_offset": 0,
+ "end_offset": 20,
+ "type": "",
+ "position": 0
+ }
+ ]
+}
+```
diff --git a/_analyzers/token-filters/pattern-replace.md b/_analyzers/token-filters/pattern-replace.md
new file mode 100644
index 0000000000..73ef7fa7d8
--- /dev/null
+++ b/_analyzers/token-filters/pattern-replace.md
@@ -0,0 +1,116 @@
+---
+layout: default
+title: Pattern replace
+parent: Token filters
+nav_order: 320
+---
+
+# Pattern replace token filter
+
+The `pattern_replace` token filter allows you to modify tokens using regular expressions. This filter replaces patterns in tokens with the specified values, giving you flexibility in transforming or normalizing tokens before indexing them. It's particularly useful when you need to clean or standardize text during analysis.
+
+## Parameters
+
+The `pattern_replace` token filter can be configured with the following parameters.
+
+Parameter | Required/Optional | Data type | Description
+:--- | :--- | :--- | :---
+`pattern` | Required | String | A regular expression pattern that matches the text that needs to be replaced.
+`all` | Optional | Boolean | Whether to replace all pattern matches. If `false`, only the first match is replaced. Default is `true`.
+`replacement` | Optional | String | A string with which to replace the matched pattern. Default is an empty string.
+
+
+## Example
+
+The following example request creates a new index named `text_index` and configures an analyzer with a `pattern_replace` filter to replace tokens containing digits with the string `[NUM]`:
+
+```json
+PUT /text_index
+{
+ "settings": {
+ "analysis": {
+ "filter": {
+ "number_replace_filter": {
+ "type": "pattern_replace",
+ "pattern": "\\d+",
+ "replacement": "[NUM]"
+ }
+ },
+ "analyzer": {
+ "number_analyzer": {
+ "tokenizer": "standard",
+ "filter": [
+ "lowercase",
+ "number_replace_filter"
+ ]
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+POST /text_index/_analyze
+{
+ "text": "Visit us at 98765 Example St.",
+ "analyzer": "number_analyzer"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "visit",
+ "start_offset": 0,
+ "end_offset": 5,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "us",
+ "start_offset": 6,
+ "end_offset": 8,
+ "type": "",
+ "position": 1
+ },
+ {
+ "token": "at",
+ "start_offset": 9,
+ "end_offset": 11,
+ "type": "",
+ "position": 2
+ },
+ {
+ "token": "[NUM]",
+ "start_offset": 12,
+ "end_offset": 17,
+ "type": "",
+ "position": 3
+ },
+ {
+ "token": "example",
+ "start_offset": 18,
+ "end_offset": 25,
+ "type": "",
+ "position": 4
+ },
+ {
+ "token": "st",
+ "start_offset": 26,
+ "end_offset": 28,
+ "type": "",
+ "position": 5
+ }
+ ]
+}
+```
diff --git a/_analyzers/token-filters/phonetic.md b/_analyzers/token-filters/phonetic.md
new file mode 100644
index 0000000000..7fe380851f
--- /dev/null
+++ b/_analyzers/token-filters/phonetic.md
@@ -0,0 +1,98 @@
+---
+layout: default
+title: Phonetic
+parent: Token filters
+nav_order: 330
+---
+
+# Phonetic token filter
+
+The `phonetic` token filter transforms tokens into their phonetic representations, enabling more flexible matching of words that sound similar but are spelled differently. This is particularly useful for searching names, brands, or other entities that users might spell differently but pronounce similarly.
+
+The `phonetic` token filter is not included in OpenSearch distributions by default. To use this token filter, you must first install the `analysis-phonetic` plugin as follows and then restart OpenSearch:
+
+```bash
+./bin/opensearch-plugin install analysis-phonetic
+```
+{% include copy.html %}
+
+For more information about installing plugins, see [Installing plugins]({{site.url}}{{site.baseurl}}/install-and-configure/plugins/).
+{: .note}
+
+## Parameters
+
+The `phonetic` token filter can be configured with the following parameters.
+
+Parameter | Required/Optional | Data type | Description
+:--- | :--- | :--- | :---
+`encoder` | Optional | String | Specifies the phonetic algorithm to use.
Valid values are:
- `metaphone` (default)
- `double_metaphone`
- `soundex`
- `refined_soundex`
- `caverphone1`
- `caverphone2`
- `cologne`
- `nysiis`
- `koelnerphonetik`
- `haasephonetik`
- `beider_morse`
- `daitch_mokotoff `
+`replace` | Optional | Boolean | Whether to replace the original token. If `false`, the original token is included in the output along with the phonetic encoding. Default is `true`.
+
+
+## Example
+
+The following example request creates a new index named `names_index` and configures an analyzer with a `phonetic` filter:
+
+```json
+PUT /names_index
+{
+ "settings": {
+ "analysis": {
+ "filter": {
+ "my_phonetic_filter": {
+ "type": "phonetic",
+ "encoder": "double_metaphone",
+ "replace": true
+ }
+ },
+ "analyzer": {
+ "phonetic_analyzer": {
+ "tokenizer": "standard",
+ "filter": [
+ "my_phonetic_filter"
+ ]
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated for the names `Stephen` and `Steven` using the analyzer:
+
+```json
+POST /names_index/_analyze
+{
+ "text": "Stephen",
+ "analyzer": "phonetic_analyzer"
+}
+```
+{% include copy-curl.html %}
+
+```json
+POST /names_index/_analyze
+{
+ "text": "Steven",
+ "analyzer": "phonetic_analyzer"
+}
+```
+{% include copy-curl.html %}
+
+In both cases, the response contains the same generated token:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "STFN",
+ "start_offset": 0,
+ "end_offset": 6,
+ "type": "",
+ "position": 0
+ }
+ ]
+}
+```
diff --git a/_analyzers/token-filters/porter-stem.md b/_analyzers/token-filters/porter-stem.md
new file mode 100644
index 0000000000..fa2f4208a7
--- /dev/null
+++ b/_analyzers/token-filters/porter-stem.md
@@ -0,0 +1,83 @@
+---
+layout: default
+title: Porter stem
+parent: Token filters
+nav_order: 340
+---
+
+# Porter stem token filter
+
+The `porter_stem` token filter reduces words to their base (or _stem_) form and removes common suffixes from words, which helps in matching similar words by their root. For example, the word `running` is stemmed to `run`. This token filter is primarily used for the English language and provides stemming based on the [Porter stemming algorithm](https://snowballstem.org/algorithms/porter/stemmer.html).
+
+
+## Example
+
+The following example request creates a new index named `my_stem_index` and configures an analyzer with a `porter_stem` filter:
+
+```json
+PUT /my_stem_index
+{
+ "settings": {
+ "analysis": {
+ "filter": {
+ "my_porter_stem": {
+ "type": "porter_stem"
+ }
+ },
+ "analyzer": {
+ "porter_analyzer": {
+ "tokenizer": "standard",
+ "filter": [
+ "lowercase",
+ "my_porter_stem"
+ ]
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+POST /my_stem_index/_analyze
+{
+ "text": "running runners ran",
+ "analyzer": "porter_analyzer"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "run",
+ "start_offset": 0,
+ "end_offset": 7,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "runner",
+ "start_offset": 8,
+ "end_offset": 15,
+ "type": "",
+ "position": 1
+ },
+ {
+ "token": "ran",
+ "start_offset": 16,
+ "end_offset": 19,
+ "type": "",
+ "position": 2
+ }
+ ]
+}
+```
diff --git a/_analyzers/token-filters/predicate-token-filter.md b/_analyzers/token-filters/predicate-token-filter.md
new file mode 100644
index 0000000000..24729f0224
--- /dev/null
+++ b/_analyzers/token-filters/predicate-token-filter.md
@@ -0,0 +1,82 @@
+---
+layout: default
+title: Predicate token filter
+parent: Token filters
+nav_order: 340
+---
+
+# Predicate token filter
+
+The `predicate_token_filter` evaluates whether tokens should be kept or discarded, depending on the conditions defined in a custom script. The tokens are evaluated in the analysis predicate context. This filter supports only inline Painless scripts.
+
+## Parameters
+
+The `predicate_token_filter` has one required parameter: `script`. This parameter provides a condition that is used to evaluate whether the token should be kept.
+
+## Example
+
+The following example request creates a new index named `predicate_index` and configures an analyzer with a `predicate_token_filter`. The filter specifies to only output tokens if they are longer than 7 characters:
+
+```json
+PUT /predicate_index
+{
+ "settings": {
+ "analysis": {
+ "filter": {
+ "my_predicate_filter": {
+ "type": "predicate_token_filter",
+ "script": {
+ "source": "token.term.length() > 7"
+ }
+ }
+ },
+ "analyzer": {
+ "predicate_analyzer": {
+ "tokenizer": "standard",
+ "filter": [
+ "lowercase",
+ "my_predicate_filter"
+ ]
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+POST /predicate_index/_analyze
+{
+ "text": "The OpenSearch community is growing rapidly",
+ "analyzer": "predicate_analyzer"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "opensearch",
+ "start_offset": 4,
+ "end_offset": 14,
+ "type": "",
+ "position": 1
+ },
+ {
+ "token": "community",
+ "start_offset": 15,
+ "end_offset": 24,
+ "type": "",
+ "position": 2
+ }
+ ]
+}
+```
diff --git a/_analyzers/token-filters/remove-duplicates.md b/_analyzers/token-filters/remove-duplicates.md
new file mode 100644
index 0000000000..b0a589884a
--- /dev/null
+++ b/_analyzers/token-filters/remove-duplicates.md
@@ -0,0 +1,152 @@
+---
+layout: default
+title: Remove duplicates
+parent: Token filters
+nav_order: 350
+---
+
+# Remove duplicates token filter
+
+The `remove_duplicates` token filter is used to remove duplicate tokens that are generated in the same position during analysis.
+
+## Example
+
+The following example request creates an index with a `keyword_repeat` token filter. The filter adds a `keyword` version of each token in the same position as the token itself and then uses a `kstem` to create a stemmed version of the token:
+
+```json
+PUT /example-index
+{
+ "settings": {
+ "analysis": {
+ "analyzer": {
+ "custom_analyzer": {
+ "type": "custom",
+ "tokenizer": "standard",
+ "filter": [
+ "lowercase",
+ "keyword_repeat",
+ "kstem"
+ ]
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+Use the following request to analyze the string `Slower turtle`:
+
+```json
+GET /example-index/_analyze
+{
+ "analyzer": "custom_analyzer",
+ "text": "Slower turtle"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the token `turtle` twice in the same position:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "slower",
+ "start_offset": 0,
+ "end_offset": 6,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "slow",
+ "start_offset": 0,
+ "end_offset": 6,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "turtle",
+ "start_offset": 7,
+ "end_offset": 13,
+ "type": "",
+ "position": 1
+ },
+ {
+ "token": "turtle",
+ "start_offset": 7,
+ "end_offset": 13,
+ "type": "",
+ "position": 1
+ }
+ ]
+}
+```
+
+The duplicate token can be removed by adding a `remove_duplicates` token filter to the index settings:
+
+```json
+PUT /index-remove-duplicate
+{
+ "settings": {
+ "analysis": {
+ "analyzer": {
+ "custom_analyzer": {
+ "type": "custom",
+ "tokenizer": "standard",
+ "filter": [
+ "lowercase",
+ "keyword_repeat",
+ "kstem",
+ "remove_duplicates"
+ ]
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+GET /index-remove-duplicate/_analyze
+{
+ "analyzer": "custom_analyzer",
+ "text": "Slower turtle"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "slower",
+ "start_offset": 0,
+ "end_offset": 6,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "slow",
+ "start_offset": 0,
+ "end_offset": 6,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "turtle",
+ "start_offset": 7,
+ "end_offset": 13,
+ "type": "",
+ "position": 1
+ }
+ ]
+}
+```
\ No newline at end of file
diff --git a/_analyzers/token-filters/synonym-graph.md b/_analyzers/token-filters/synonym-graph.md
new file mode 100644
index 0000000000..75c7c79151
--- /dev/null
+++ b/_analyzers/token-filters/synonym-graph.md
@@ -0,0 +1,180 @@
+---
+layout: default
+title: Synonym graph
+parent: Token filters
+nav_order: 420
+---
+
+# Synonym graph token filter
+
+The `synonym_graph` token filter is a more advanced version of the `synonym` token filter. It supports multiword synonyms and processes synonyms across multiple tokens, making it ideal for phrases or scenarios in which relationships between tokens are important.
+
+## Parameters
+
+The `synonym_graph` token filter can be configured with the following parameters.
+
+Parameter | Required/Optional | Data type | Description
+:--- | :--- | :--- | :---
+`synonyms` | Either `synonyms` or `synonyms_path` must be specified | String | A list of synonym rules defined directly in the configuration.
+`synonyms_path` | Either `synonyms` or `synonyms_path` must be specified | String | The file path to a file containing synonym rules (either an absolute path or a path relative to the config directory).
+`lenient` | Optional | Boolean | Whether to ignore exceptions when loading the rule configurations. Default is `false`.
+`format` | Optional | String | Specifies the format used to determine how OpenSearch defines and interprets synonyms. Valid values are:
- `solr`
- [`wordnet`](https://wordnet.princeton.edu/).
Default is `solr`.
+`expand` | Optional | Boolean | Whether to expand equivalent synonym rules. Default is `false`.
For example:
If `synonyms` are defined as `"quick, fast"` and `expand` is set to `true`, then the synonym rules are configured as follows:
- `quick => quick`
- `quick => fast`
- `fast => quick`
- `fast => fast`
If `expand` is set to `false`, the synonym rules are configured as follows:
- `quick => quick`
- `fast => quick`
+
+## Example: Solr format
+
+The following example request creates a new index named `my-index` and configures an analyzer with a `synonym_graph` filter. The filter is configured with the default `solr` rule format:
+
+```json
+PUT /my-index
+{
+ "settings": {
+ "analysis": {
+ "filter": {
+ "my_synonym_graph_filter": {
+ "type": "synonym_graph",
+ "synonyms": [
+ "sports car, race car",
+ "fast car, speedy vehicle",
+ "luxury car, premium vehicle",
+ "electric car, EV"
+ ]
+ }
+ },
+ "analyzer": {
+ "my_synonym_graph_analyzer": {
+ "type": "custom",
+ "tokenizer": "standard",
+ "filter": [
+ "lowercase",
+ "my_synonym_graph_filter"
+ ]
+ }
+ }
+ }
+ }
+}
+
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+GET /my-car-index/_analyze
+{
+ "analyzer": "my_synonym_graph_analyzer",
+ "text": "I just bought a sports car and it is a fast car."
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {"token": "i","start_offset": 0,"end_offset": 1,"type": "","position": 0},
+ {"token": "just","start_offset": 2,"end_offset": 6,"type": "","position": 1},
+ {"token": "bought","start_offset": 7,"end_offset": 13,"type": "","position": 2},
+ {"token": "a","start_offset": 14,"end_offset": 15,"type": "","position": 3},
+ {"token": "race","start_offset": 16,"end_offset": 26,"type": "SYNONYM","position": 4},
+ {"token": "sports","start_offset": 16,"end_offset": 22,"type": "","position": 4,"positionLength": 2},
+ {"token": "car","start_offset": 16,"end_offset": 26,"type": "SYNONYM","position": 5,"positionLength": 2},
+ {"token": "car","start_offset": 23,"end_offset": 26,"type": "","position": 6},
+ {"token": "and","start_offset": 27,"end_offset": 30,"type": "","position": 7},
+ {"token": "it","start_offset": 31,"end_offset": 33,"type": "","position": 8},
+ {"token": "is","start_offset": 34,"end_offset": 36,"type": "","position": 9},
+ {"token": "a","start_offset": 37,"end_offset": 38,"type": "","position": 10},
+ {"token": "speedy","start_offset": 39,"end_offset": 47,"type": "SYNONYM","position": 11},
+ {"token": "fast","start_offset": 39,"end_offset": 43,"type": "","position": 11,"positionLength": 2},
+ {"token": "vehicle","start_offset": 39,"end_offset": 47,"type": "SYNONYM","position": 12,"positionLength": 2},
+ {"token": "car","start_offset": 44,"end_offset": 47,"type": "","position": 13}
+ ]
+}
+```
+
+## Example: WordNet format
+
+The following example request creates a new index named `my-wordnet-index` and configures an analyzer with a `synonym_graph` filter. The filter is configured with the [`wordnet`](https://wordnet.princeton.edu/) rule format:
+
+```json
+PUT /my-wordnet-index
+{
+ "settings": {
+ "analysis": {
+ "filter": {
+ "my_synonym_graph_filter": {
+ "type": "synonym_graph",
+ "format": "wordnet",
+ "synonyms": [
+ "s(100000001, 1, 'sports car', n, 1, 0).",
+ "s(100000001, 2, 'race car', n, 1, 0).",
+ "s(100000001, 3, 'fast car', n, 1, 0).",
+ "s(100000001, 4, 'speedy vehicle', n, 1, 0)."
+ ]
+ }
+ },
+ "analyzer": {
+ "my_synonym_graph_analyzer": {
+ "type": "custom",
+ "tokenizer": "standard",
+ "filter": [
+ "lowercase",
+ "my_synonym_graph_filter"
+ ]
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+GET /my-wordnet-index/_analyze
+{
+ "analyzer": "my_synonym_graph_analyzer",
+ "text": "I just bought a sports car and it is a fast car."
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {"token": "i","start_offset": 0,"end_offset": 1,"type": "","position": 0},
+ {"token": "just","start_offset": 2,"end_offset": 6,"type": "","position": 1},
+ {"token": "bought","start_offset": 7,"end_offset": 13,"type": "","position": 2},
+ {"token": "a","start_offset": 14,"end_offset": 15,"type": "","position": 3},
+ {"token": "race","start_offset": 16,"end_offset": 26,"type": "SYNONYM","position": 4},
+ {"token": "fast","start_offset": 16,"end_offset": 26,"type": "SYNONYM","position": 4,"positionLength": 2},
+ {"token": "speedy","start_offset": 16,"end_offset": 26,"type": "SYNONYM","position": 4,"positionLength": 3},
+ {"token": "sports","start_offset": 16,"end_offset": 22,"type": "","position": 4,"positionLength": 4},
+ {"token": "car","start_offset": 16,"end_offset": 26,"type": "SYNONYM","position": 5,"positionLength": 4},
+ {"token": "car","start_offset": 16,"end_offset": 26,"type": "SYNONYM","position": 6,"positionLength": 3},
+ {"token": "vehicle","start_offset": 16,"end_offset": 26,"type": "SYNONYM","position": 7,"positionLength": 2},
+ {"token": "car","start_offset": 23,"end_offset": 26,"type": "","position": 8},
+ {"token": "and","start_offset": 27,"end_offset": 30,"type": "","position": 9},
+ {"token": "it","start_offset": 31,"end_offset": 33,"type": "","position": 10},
+ {"token": "is","start_offset": 34,"end_offset": 36,"type": "","position": 11},
+ {"token": "a","start_offset": 37,"end_offset": 38,"type": "","position": 12},
+ {"token": "sports","start_offset": 39,"end_offset": 47,"type": "SYNONYM","position": 13},
+ {"token": "race","start_offset": 39,"end_offset": 47,"type": "SYNONYM","position": 13,"positionLength": 2},
+ {"token": "speedy","start_offset": 39,"end_offset": 47,"type": "SYNONYM","position": 13,"positionLength": 3},
+ {"token": "fast","start_offset": 39,"end_offset": 43,"type": "","position": 13,"positionLength": 4},
+ {"token": "car","start_offset": 39,"end_offset": 47,"type": "SYNONYM","position": 14,"positionLength": 4},
+ {"token": "car","start_offset": 39,"end_offset": 47,"type": "SYNONYM","position": 15,"positionLength": 3},
+ {"token": "vehicle","start_offset": 39,"end_offset": 47,"type": "SYNONYM","position": 16,"positionLength": 2},
+ {"token": "car","start_offset": 44,"end_offset": 47,"type": "","position": 17}
+ ]
+}
+```
diff --git a/_analyzers/token-filters/synonym.md b/_analyzers/token-filters/synonym.md
new file mode 100644
index 0000000000..a6865b14d7
--- /dev/null
+++ b/_analyzers/token-filters/synonym.md
@@ -0,0 +1,277 @@
+---
+layout: default
+title: Synonym
+parent: Token filters
+nav_order: 420
+---
+
+# Synonym token filter
+
+The `synonym` token filter allows you to map multiple terms to a single term or create equivalence groups between words, improving search flexibility.
+
+## Parameters
+
+The `synonym` token filter can be configured with the following parameters.
+
+Parameter | Required/Optional | Data type | Description
+:--- | :--- | :--- | :---
+`synonyms` | Either `synonyms` or `synonyms_path` must be specified | String | A list of synonym rules defined directly in the configuration.
+`synonyms_path` | Either `synonyms` or `synonyms_path` must be specified | String | The file path to a file containing synonym rules (either an absolute path or a path relative to the config directory).
+`lenient` | Optional | Boolean | Whether to ignore exceptions when loading the rule configurations. Default is `false`.
+`format` | Optional | String | Specifies the format used to determine how OpenSearch defines and interprets synonyms. Valid values are:
- `solr`
- [`wordnet`](https://wordnet.princeton.edu/).
Default is `solr`.
+`expand` | Optional | Boolean | Whether to expand equivalent synonym rules. Default is `false`.
For example:
If `synonyms` are defined as `"quick, fast"` and `expand` is set to `true`, then the synonym rules are configured as follows:
- `quick => quick`
- `quick => fast`
- `fast => quick`
- `fast => fast`
If `expand` is set to `false`, the synonym rules are configured as follows:
- `quick => quick`
- `fast => quick`
+
+## Example: Solr format
+
+The following example request creates a new index named `my-synonym-index` and configures an analyzer with a `synonym` filter. The filter is configured with the default `solr` rule format:
+
+```json
+PUT /my-synonym-index
+{
+ "settings": {
+ "analysis": {
+ "filter": {
+ "my_synonym_filter": {
+ "type": "synonym",
+ "synonyms": [
+ "car, automobile",
+ "quick, fast, speedy",
+ "laptop => computer"
+ ]
+ }
+ },
+ "analyzer": {
+ "my_synonym_analyzer": {
+ "type": "custom",
+ "tokenizer": "standard",
+ "filter": [
+ "lowercase",
+ "my_synonym_filter"
+ ]
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+GET /my-synonym-index/_analyze
+{
+ "analyzer": "my_synonym_analyzer",
+ "text": "The quick dog jumps into the car with a laptop"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "the",
+ "start_offset": 0,
+ "end_offset": 3,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "quick",
+ "start_offset": 4,
+ "end_offset": 9,
+ "type": "",
+ "position": 1
+ },
+ {
+ "token": "fast",
+ "start_offset": 4,
+ "end_offset": 9,
+ "type": "SYNONYM",
+ "position": 1
+ },
+ {
+ "token": "speedy",
+ "start_offset": 4,
+ "end_offset": 9,
+ "type": "SYNONYM",
+ "position": 1
+ },
+ {
+ "token": "dog",
+ "start_offset": 10,
+ "end_offset": 13,
+ "type": "",
+ "position": 2
+ },
+ {
+ "token": "jumps",
+ "start_offset": 14,
+ "end_offset": 19,
+ "type": "",
+ "position": 3
+ },
+ {
+ "token": "into",
+ "start_offset": 20,
+ "end_offset": 24,
+ "type": "",
+ "position": 4
+ },
+ {
+ "token": "the",
+ "start_offset": 25,
+ "end_offset": 28,
+ "type": "",
+ "position": 5
+ },
+ {
+ "token": "car",
+ "start_offset": 29,
+ "end_offset": 32,
+ "type": "",
+ "position": 6
+ },
+ {
+ "token": "automobile",
+ "start_offset": 29,
+ "end_offset": 32,
+ "type": "SYNONYM",
+ "position": 6
+ },
+ {
+ "token": "with",
+ "start_offset": 33,
+ "end_offset": 37,
+ "type": "",
+ "position": 7
+ },
+ {
+ "token": "a",
+ "start_offset": 38,
+ "end_offset": 39,
+ "type": "",
+ "position": 8
+ },
+ {
+ "token": "computer",
+ "start_offset": 40,
+ "end_offset": 46,
+ "type": "SYNONYM",
+ "position": 9
+ }
+ ]
+}
+```
+
+## Example: WordNet format
+
+The following example request creates a new index named `my-wordnet-index` and configures an analyzer with a `synonym` filter. The filter is configured with the [`wordnet`](https://wordnet.princeton.edu/) rule format:
+
+```json
+PUT /my-wordnet-index
+{
+ "settings": {
+ "analysis": {
+ "filter": {
+ "my_wordnet_synonym_filter": {
+ "type": "synonym",
+ "format": "wordnet",
+ "synonyms": [
+ "s(100000001,1,'fast',v,1,0).",
+ "s(100000001,2,'quick',v,1,0).",
+ "s(100000001,3,'swift',v,1,0)."
+ ]
+ }
+ },
+ "analyzer": {
+ "my_wordnet_analyzer": {
+ "type": "custom",
+ "tokenizer": "standard",
+ "filter": [
+ "lowercase",
+ "my_wordnet_synonym_filter"
+ ]
+ }
+ }
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+GET /my-wordnet-index/_analyze
+{
+ "analyzer": "my_wordnet_analyzer",
+ "text": "I have a fast car"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+ "tokens": [
+ {
+ "token": "i",
+ "start_offset": 0,
+ "end_offset": 1,
+ "type": "",
+ "position": 0
+ },
+ {
+ "token": "have",
+ "start_offset": 2,
+ "end_offset": 6,
+ "type": "",
+ "position": 1
+ },
+ {
+ "token": "a",
+ "start_offset": 7,
+ "end_offset": 8,
+ "type": "",
+ "position": 2
+ },
+ {
+ "token": "fast",
+ "start_offset": 9,
+ "end_offset": 13,
+ "type": "",
+ "position": 3
+ },
+ {
+ "token": "quick",
+ "start_offset": 9,
+ "end_offset": 13,
+ "type": "SYNONYM",
+ "position": 3
+ },
+ {
+ "token": "swift",
+ "start_offset": 9,
+ "end_offset": 13,
+ "type": "SYNONYM",
+ "position": 3
+ },
+ {
+ "token": "car",
+ "start_offset": 14,
+ "end_offset": 17,
+ "type": "",
+ "position": 4
+ }
+ ]
+}
+```
diff --git a/_search-plugins/search-pipelines/ml-inference-search-response.md b/_search-plugins/search-pipelines/ml-inference-search-response.md
index e8f17a667c..b0573d17be 100644
--- a/_search-plugins/search-pipelines/ml-inference-search-response.md
+++ b/_search-plugins/search-pipelines/ml-inference-search-response.md
@@ -751,4 +751,8 @@ The response includes the original documents and their reranked scores:
"shards": []
}
}
-```
\ No newline at end of file
+```
+
+## Next steps
+
+- See a comprehensive example of [reranking by a field using an externally hosted cross-encoder model]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/rerank-by-field-cross-encoder/).
\ No newline at end of file
diff --git a/_search-plugins/search-pipelines/rerank-processor.md b/_search-plugins/search-pipelines/rerank-processor.md
index 84819b17c8..11691eff95 100644
--- a/_search-plugins/search-pipelines/rerank-processor.md
+++ b/_search-plugins/search-pipelines/rerank-processor.md
@@ -191,4 +191,5 @@ POST /book-index/_search?search_pipeline=rerank_byfield_pipeline
- Learn more about [reranking search results]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/reranking-search-results/).
- See a complete example of [reranking using a cross-encoder model]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/rerank-cross-encoder/).
-- See a complete example of [reranking by a document field]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/rerank-by-field/).
\ No newline at end of file
+- See a complete example of [reranking by a document field]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/rerank-by-field/).
+- See a comprehensive example of [reranking by a field using an externally hosted cross-encoder model]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/rerank-by-field-cross-encoder/).
\ No newline at end of file
diff --git a/_search-plugins/search-relevance/rerank-by-field-cross-encoder.md b/_search-plugins/search-relevance/rerank-by-field-cross-encoder.md
new file mode 100644
index 0000000000..7f30689491
--- /dev/null
+++ b/_search-plugins/search-relevance/rerank-by-field-cross-encoder.md
@@ -0,0 +1,276 @@
+---
+layout: default
+title: Reranking by a field using a cross-encoder
+parent: Reranking search results
+grand_parent: Search relevance
+has_children: false
+nav_order: 30
+---
+
+# Reranking by a field using an externally hosted cross-encoder model
+Introduced 2.18
+{: .label .label-purple }
+
+In this tutorial, you'll learn how to use a cross-encoder model hosted on Amazon SageMaker to rerank search results and improve search relevance.
+
+To rerank documents, you'll configure a search pipeline that processes search results at query time. The pipeline intercepts search results and passes them to the [`ml_inference` search response processor]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/ml-inference-search-response/), which invokes the cross-encoder model. The model generates scores used to rerank the matching documents [`by_field`]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/rerank-by-field/).
+
+## Prerequisite: Deploy a model on Amazon SageMaker
+
+Run the following code to deploy a model on Amazon SageMaker. For this example, you'll use the [`ms-marco-MiniLM-L-6-v2`](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2) Hugging Face cross-encoder model hosted on Amazon SageMaker. We recommend using a GPU for better performance:
+
+```python
+import sagemaker
+import boto3
+from sagemaker.huggingface import HuggingFaceModel
+
+sess = sagemaker.Session()
+role = sagemaker.get_execution_role()
+
+hub = {
+ 'HF_MODEL_ID':'cross-encoder/ms-marco-MiniLM-L-6-v2',
+ 'HF_TASK':'text-classification'
+}
+huggingface_model = HuggingFaceModel(
+ transformers_version='4.37.0',
+ pytorch_version='2.1.0',
+ py_version='py310',
+ env=hub,
+ role=role,
+)
+predictor = huggingface_model.deploy(
+ initial_instance_count=1, # number of instances
+ instance_type='ml.m5.xlarge' # ec2 instance type
+)
+```
+{% include copy.html %}
+
+After deploying the model, you can find the model endpoint by going to the Amazon SageMaker console in the AWS Management Console and selecting **Inference > Endpoints** on the left tab. Note the URL for the created model; you'll use it to create a connector.
+
+## Running a search with reranking
+
+To run a search with reranking, follow these steps:
+
+1. [Create a connector](#step-1-create-a-connector).
+1. [Register the model](#step-2-register-the-model).
+1. [Ingest documents into an index](#step-3-ingest-documents-into-an-index).
+1. [Create a search pipeline](#step-4-create-a-search-pipeline).
+1. [Search using reranking](#step-5-search-using-reranking).
+
+## Step 1: Create a connector
+
+Create a connector to the cross-encoder model by providing the model URL in the `actions.url` parameter:
+
+```json
+POST /_plugins/_ml/connectors/_create
+{
+ "name": "SageMaker cross-encoder model",
+ "description": "Test connector for SageMaker cross-encoder hosted model",
+ "version": 1,
+ "protocol": "aws_sigv4",
+ "credential": {
+ "access_key": "",
+ "secret_key": "",
+ "session_token": ""
+ },
+ "parameters": {
+ "region": "",
+ "service_name": "sagemaker"
+ },
+ "actions": [
+ {
+ "action_type": "predict",
+ "method": "POST",
+ "url": "",
+ "headers": {
+ "content-type": "application/json"
+ },
+ "request_body": "{ \"inputs\": { \"text\": \"${parameters.text}\", \"text_pair\": \"${parameters.text_pair}\" }}"
+ }
+ ]
+}
+```
+{% include copy-curl.html %}
+
+Note the connector ID contained in the response; you'll use it in the following step.
+
+## Step 2: Register the model
+
+To register the model, provide the connector ID in the `connector_id` parameter:
+
+```json
+POST /_plugins/_ml/models/_register
+{
+ "name": "Cross encoder model",
+ "version": "1.0.1",
+ "function_name": "remote",
+ "description": "Using a SageMaker endpoint to apply a cross encoder model",
+ "connector_id": ""
+}
+```
+{% include copy-curl.html %}
+
+
+## Step 3: Ingest documents into an index
+
+Create an index and ingest sample documents containing facts about the New York City boroughs:
+
+```json
+POST /nyc_areas/_bulk
+{ "index": { "_id": 1 } }
+{ "borough": "Queens", "area_name": "Astoria", "description": "Astoria is a neighborhood in the western part of Queens, New York City, known for its diverse community and vibrant cultural scene.", "population": 93000, "facts": "Astoria is home to many artists and has a large Greek-American community. The area also boasts some of the best Mediterranean food in NYC." }
+{ "index": { "_id": 2 } }
+{ "borough": "Queens", "area_name": "Flushing", "description": "Flushing is a neighborhood in the northern part of Queens, famous for its Asian-American population and bustling business district.", "population": 227000, "facts": "Flushing is one of the most ethnically diverse neighborhoods in NYC, with a large Chinese and Korean population. It is also home to the USTA Billie Jean King National Tennis Center." }
+{ "index": { "_id": 3 } }
+{ "borough": "Brooklyn", "area_name": "Williamsburg", "description": "Williamsburg is a trendy neighborhood in Brooklyn known for its hipster culture, vibrant art scene, and excellent restaurants.", "population": 150000, "facts": "Williamsburg is a hotspot for young professionals and artists. The neighborhood has seen rapid gentrification over the past two decades." }
+{ "index": { "_id": 4 } }
+{ "borough": "Manhattan", "area_name": "Harlem", "description": "Harlem is a historic neighborhood in Upper Manhattan, known for its significant African-American cultural heritage.", "population": 116000, "facts": "Harlem was the birthplace of the Harlem Renaissance, a cultural movement that celebrated Black culture through art, music, and literature." }
+{ "index": { "_id": 5 } }
+{ "borough": "The Bronx", "area_name": "Riverdale", "description": "Riverdale is a suburban-like neighborhood in the Bronx, known for its leafy streets and affluent residential areas.", "population": 48000, "facts": "Riverdale is one of the most affluent areas in the Bronx, with beautiful parks, historic homes, and excellent schools." }
+{ "index": { "_id": 6 } }
+{ "borough": "Staten Island", "area_name": "St. George", "description": "St. George is the main commercial and cultural center of Staten Island, offering stunning views of Lower Manhattan.", "population": 15000, "facts": "St. George is home to the Staten Island Ferry terminal and is a gateway to Staten Island, offering stunning views of the Statue of Liberty and Ellis Island." }
+```
+{% include copy-curl.html %}
+
+## Step 4: Create a search pipeline
+
+Next, create a search pipeline for reranking. In the search pipeline configuration, the `input_map` and `output_map` define how the input data is prepared for the cross-encoder model and how the model's output is interpreted for reranking:
+
+- The `input_map` specifies which fields in the search documents and the query should be used as model inputs:
+ - The `text` field maps to the `facts` field in the indexed documents. It provides the document-specific content that the model will analyze.
+ - The `text_pair` field dynamically retrieves the search query text (`multi_match.query`) from the search request.
+
+ The combination of `text` (document `facts`) and `text_pair` (search `query`) allows the cross-encoder model to compare the relevance of the document to the query, considering their semantic relationship.
+
+- The `output_map` field specifies how the output of the model is mapped to the fields in the response:
+ - The `rank_score` field in the response will store the model's relevance score, which will be used to perform reranking.
+
+When using the `by_field` rerank type, the `rank_score` field will contain the same score as the `_score` field. To remove the `rank_score` field from the search results, set `remove_target_field` to `true`. The original BM25 score, before reranking, is included for debugging purposes by setting `keep_previous_score` to `true`. This allows you to compare the original score with the reranked score to evaluate improvements in search relevance.
+
+To create the search pipeline, send the following request:
+
+```json
+PUT /_search/pipeline/my_pipeline
+{
+ "response_processors": [
+ {
+ "ml_inference": {
+ "tag": "ml_inference",
+ "description": "This processor runs ml inference during search response",
+ "model_id": "",
+ "function_name": "REMOTE",
+ "input_map": [
+ {
+ "text": "facts",
+ "text_pair":"$._request.query.multi_match.query"
+ }
+ ],
+ "output_map": [
+ {
+ "rank_score": "$.score"
+ }
+ ],
+ "full_response_path": false,
+ "model_config": {},
+ "ignore_missing": false,
+ "ignore_failure": false,
+ "one_to_one": true
+ },
+
+ "rerank": {
+ "by_field": {
+ "target_field": "rank_score",
+ "remove_target_field": true,
+ "keep_previous_score" : true
+ }
+ }
+
+ }
+ ]
+}
+```
+{% include copy-curl.html %}
+
+## Step 5: Search using reranking
+
+Use the following request to search indexed documents and rerank them using the cross-encoder model. The request retrieves documents containing any of the specified terms in the `description` or `facts` fields. These terms are then used to compare and rerank the matched documents:
+
+```json
+POST /nyc_areas/_search?search_pipeline=my_pipeline
+{
+ "query": {
+ "multi_match": {
+ "query": "artists art creative community",
+ "fields": ["description", "facts"]
+ }
+ }
+}
+```
+{% include copy-curl.html %}
+
+In the response, the `previous_score` field contains the document's BM25 score, which it would have received if you hadn't applied the pipeline. Note that while BM25 ranked "Astoria" the highest, the cross-encoder model prioritized "Harlem" because it matched more search terms:
+
+```json
+{
+ "took": 4,
+ "timed_out": false,
+ "_shards": {
+ "total": 1,
+ "successful": 1,
+ "skipped": 0,
+ "failed": 0
+ },
+ "hits": {
+ "total": {
+ "value": 3,
+ "relation": "eq"
+ },
+ "max_score": 0.03418137,
+ "hits": [
+ {
+ "_index": "nyc_areas",
+ "_id": "4",
+ "_score": 0.03418137,
+ "_source": {
+ "area_name": "Harlem",
+ "description": "Harlem is a historic neighborhood in Upper Manhattan, known for its significant African-American cultural heritage.",
+ "previous_score": 1.6489418,
+ "borough": "Manhattan",
+ "facts": "Harlem was the birthplace of the Harlem Renaissance, a cultural movement that celebrated Black culture through art, music, and literature.",
+ "population": 116000
+ }
+ },
+ {
+ "_index": "nyc_areas",
+ "_id": "1",
+ "_score": 0.0090838,
+ "_source": {
+ "area_name": "Astoria",
+ "description": "Astoria is a neighborhood in the western part of Queens, New York City, known for its diverse community and vibrant cultural scene.",
+ "previous_score": 2.519608,
+ "borough": "Queens",
+ "facts": "Astoria is home to many artists and has a large Greek-American community. The area also boasts some of the best Mediterranean food in NYC.",
+ "population": 93000
+ }
+ },
+ {
+ "_index": "nyc_areas",
+ "_id": "3",
+ "_score": 0.0032599436,
+ "_source": {
+ "area_name": "Williamsburg",
+ "description": "Williamsburg is a trendy neighborhood in Brooklyn known for its hipster culture, vibrant art scene, and excellent restaurants.",
+ "previous_score": 1.5632852,
+ "borough": "Brooklyn",
+ "facts": "Williamsburg is a hotspot for young professionals and artists. The neighborhood has seen rapid gentrification over the past two decades.",
+ "population": 150000
+ }
+ }
+ ]
+ },
+ "profile": {
+ "shards": []
+ }
+}
+```
+
\ No newline at end of file
diff --git a/_search-plugins/search-relevance/rerank-by-field.md b/_search-plugins/search-relevance/rerank-by-field.md
index 9c7e7419e5..e6f65a4d25 100644
--- a/_search-plugins/search-relevance/rerank-by-field.md
+++ b/_search-plugins/search-relevance/rerank-by-field.md
@@ -116,7 +116,7 @@ POST /book-index/_search
```
{% include copy-curl.html %}
-The response contains documents sorted in descending order based on the `reviews.starts` field. Each document contains the original query score in the `previous_score` field:
+The response contains documents sorted in descending order based on the `reviews.stars` field. Each document contains the original query score in the `previous_score` field:
```json
{
@@ -205,4 +205,5 @@ The response contains documents sorted in descending order based on the `reviews
## Next steps
-- Learn more about the [`rerank` processor]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/rerank-processor/).
\ No newline at end of file
+- Learn more about the [`rerank` processor]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/rerank-processor/).
+- See a comprehensive example of [reranking by a field using an externally hosted cross-encoder model]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/rerank-by-field-cross-encoder/).
\ No newline at end of file
diff --git a/_search-plugins/search-relevance/rerank-cross-encoder.md b/_search-plugins/search-relevance/rerank-cross-encoder.md
index 854908e69c..64f93c886c 100644
--- a/_search-plugins/search-relevance/rerank-cross-encoder.md
+++ b/_search-plugins/search-relevance/rerank-cross-encoder.md
@@ -118,4 +118,5 @@ Alternatively, you can provide the full path to the field containing the context
## Next steps
-- Learn more about the [`rerank` processor]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/rerank-processor/).
\ No newline at end of file
+- Learn more about the [`rerank` processor]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/rerank-processor/).
+- See a comprehensive example of [reranking by a field using an externally hosted cross-encoder model]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/rerank-by-field-cross-encoder/).
\ No newline at end of file