Skip to content

Commit

Permalink
Merge branch 'main' into expand-data-corpus
Browse files Browse the repository at this point in the history
  • Loading branch information
Naarcha-AWS authored Nov 25, 2024
2 parents 56a3c75 + 676dd77 commit 729185a
Show file tree
Hide file tree
Showing 18 changed files with 1,914 additions and 18 deletions.
26 changes: 13 additions & 13 deletions _analyzers/token-filters/index.md

Large diffs are not rendered by default.

138 changes: 138 additions & 0 deletions _analyzers/token-filters/min-hash.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
---
layout: default
title: Min hash
parent: Token filters
nav_order: 270
---

# Min hash token filter

The `min_hash` token filter is used to generate hashes for tokens based on a [MinHash](https://en.wikipedia.org/wiki/MinHash) approximation algorithm, which is useful for detecting similarity between documents. The `min_hash` token filter generates hashes for a set of tokens (typically from an analyzed field).

## Parameters

The `min_hash` token filter can be configured with the following parameters.

Parameter | Required/Optional | Data type | Description
:--- | :--- | :--- | :---
`hash_count` | Optional | Integer | The number of hash values to generate for each token. Increasing this value generally improves the accuracy of similarity estimation but increases the computational cost. Default is `1`.
`bucket_count` | Optional | Integer | The number of hash buckets to use. This affects the granularity of the hashing. A larger number of buckets provides finer granularity and reduces hash collisions but requires more memory. Default is `512`.
`hash_set_size` | Optional | Integer | The number of hashes to retain in each bucket. This can influence the hashing quality. Larger set sizes may lead to better similarity detection but consume more memory. Default is `1`.
`with_rotation` | Optional | Boolean | When set to `true`, the filter populates empty buckets with the value from the first non-empty bucket found to its circular right, provided that the `hash_set_size` is `1`. If the `bucket_count` argument exceeds `1`, this setting automatically defaults to `true`; otherwise, it defaults to `false`.

## Example

The following example request creates a new index named `minhash_index` and configures an analyzer with a `min_hash` filter:

```json
PUT /minhash_index
{
"settings": {
"analysis": {
"filter": {
"minhash_filter": {
"type": "min_hash",
"hash_count": 3,
"bucket_count": 512,
"hash_set_size": 1,
"with_rotation": false
}
},
"analyzer": {
"minhash_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"minhash_filter"
]
}
}
}
}
}
```
{% include copy-curl.html %}

## Generated tokens

Use the following request to examine the tokens generated using the analyzer:

```json
POST /minhash_index/_analyze
{
"analyzer": "minhash_analyzer",
"text": "OpenSearch is very powerful."
}
```
{% include copy-curl.html %}

The response contains the generated tokens (the tokens are not human readable because they represent hashes):

```json
{
"tokens" : [
{
"token" : "\u0000\u0000㳠锯ੲ걌䐩䉵",
"start_offset" : 0,
"end_offset" : 27,
"type" : "MIN_HASH",
"position" : 0
},
{
"token" : "\u0000\u0000㳠锯ੲ걌䐩䉵",
"start_offset" : 0,
"end_offset" : 27,
"type" : "MIN_HASH",
"position" : 0
},
...
```

In order to demonstrate the usefulness of the `min_hash` token filter, you can use the following Python script to compare the two strings using the previously created analyzer:

```python
from opensearchpy import OpenSearch
from requests.auth import HTTPBasicAuth

# Initialize the OpenSearch client with authentication
host = 'https://localhost:9200' # Update if using a different host/port
auth = ('admin', 'admin') # Username and password

# Create the OpenSearch client with SSL verification turned off
client = OpenSearch(
hosts=[host],
http_auth=auth,
use_ssl=True,
verify_certs=False, # Disable SSL certificate validation
ssl_show_warn=False # Suppress SSL warnings in the output
)

# Analyzes text and returns the minhash tokens
def analyze_text(index, text):
response = client.indices.analyze(
index=index,
body={
"analyzer": "minhash_analyzer",
"text": text
}
)
return [token['token'] for token in response['tokens']]

# Analyze two similar texts
tokens_1 = analyze_text('minhash_index', 'OpenSearch is a powerful search engine.')
tokens_2 = analyze_text('minhash_index', 'OpenSearch is a very powerful search engine.')

# Calculate Jaccard similarity
set_1 = set(tokens_1)
set_2 = set(tokens_2)
shared_tokens = set_1.intersection(set_2)
jaccard_similarity = len(shared_tokens) / len(set_1.union(set_2))

print(f"Jaccard Similarity: {jaccard_similarity}")
```

The response should contain the Jaccard similarity score:

```yaml
Jaccard Similarity: 0.8571428571428571
```
165 changes: 165 additions & 0 deletions _analyzers/token-filters/multiplexer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
---
layout: default
title: Multiplexer
parent: Token filters
nav_order: 280
---

# Multiplexer token filter

The `multiplexer` token filter allows you to create multiple versions of the same token by applying different filters. This is useful when you want to analyze the same token in multiple ways. For example, you may want to analyze a token using different stemming, synonyms, or n-gram filters and use all of the generated tokens together. This token filter works by duplicating the token stream and applying different filters to each copy.

The `multiplexer` token filter removes duplicate tokens from the token stream.
{: .important}

The `multiplexer` token filter does not support multiword `synonym` or `synonym_graph` token filters or `shingle` token filters because they need to analyze not only the current token but also upcoming tokens in order to determine how to transform the input correctly.
{: .important}

## Parameters

The `multiplexer` token filter can be configured with the following parameters.

Parameter | Required/Optional | Data type | Description
:--- | :--- | :--- | :---
`filters` | Optional | List of strings | A comma-separated list of token filters to apply to each copy of the token stream. Default is an empty list.
`preserve_original` | Optional | Boolean | Whether to keep the original token as one of the outputs. Default is `true`.

## Example

The following example request creates a new index named `multiplexer_index` and configures an analyzer with a `multiplexer` filter:

```json
PUT /multiplexer_index
{
"settings": {
"analysis": {
"filter": {
"english_stemmer": {
"type": "stemmer",
"name": "english"
},
"synonym_filter": {
"type": "synonym",
"synonyms": [
"quick,fast"
]
},
"multiplexer_filter": {
"type": "multiplexer",
"filters": ["english_stemmer", "synonym_filter"],
"preserve_original": true
}
},
"analyzer": {
"multiplexer_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"multiplexer_filter"
]
}
}
}
}
}
```
{% include copy-curl.html %}

## Generated tokens

Use the following request to examine the tokens generated using the analyzer:

```json
POST /multiplexer_index/_analyze
{
"analyzer": "multiplexer_analyzer",
"text": "The slow turtle hides from the quick dog"
}
```
{% include copy-curl.html %}

The response contains the generated tokens:

```json
{
"tokens": [
{
"token": "The",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "slow",
"start_offset": 4,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "turtle",
"start_offset": 9,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "turtl",
"start_offset": 9,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "hides",
"start_offset": 16,
"end_offset": 21,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "hide",
"start_offset": 16,
"end_offset": 21,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "from",
"start_offset": 22,
"end_offset": 26,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "the",
"start_offset": 27,
"end_offset": 30,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "quick",
"start_offset": 31,
"end_offset": 36,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "fast",
"start_offset": 31,
"end_offset": 36,
"type": "SYNONYM",
"position": 6
},
{
"token": "dog",
"start_offset": 37,
"end_offset": 40,
"type": "<ALPHANUM>",
"position": 7
}
]
}
```
Loading

0 comments on commit 729185a

Please sign in to comment.