Skip to content

Commit

Permalink
add stemmer token filter docs #8277 (#8444)
Browse files Browse the repository at this point in the history
* add stemmer token filter docs #8277

Signed-off-by: Anton Rubin <[email protected]>

* Doc review

Signed-off-by: Fanit Kolchina <[email protected]>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

---------

Signed-off-by: Anton Rubin <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
Co-authored-by: Fanit Kolchina <[email protected]>
Co-authored-by: kolchfa-aws <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
  • Loading branch information
4 people authored Dec 2, 2024
1 parent 6ff6481 commit 74ade40
Show file tree
Hide file tree
Showing 2 changed files with 119 additions and 1 deletion.
2 changes: 1 addition & 1 deletion _analyzers/token-filters/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ Token filter | Underlying Lucene token filter| Description
[`reverse`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/reverse/) | [ReverseStringFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/reverse/ReverseStringFilter.html) | Reverses the string corresponding to each token in the token stream. For example, the token `dog` becomes `god`.
[`shingle`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/shingle/) | [ShingleFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/shingle/ShingleFilter.html) | Generates shingles of lengths between `min_shingle_size` and `max_shingle_size` for tokens in the token stream. Shingles are similar to n-grams but are generated using words instead of letters. For example, two-word shingles added to the list of unigrams [`contribute`, `to`, `opensearch`] are [`contribute to`, `to opensearch`].
[`snowball`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/snowball/) | N/A | Stems words using a [Snowball-generated stemmer](https://snowballstem.org/). The `snowball` token filter supports using the following languages in the `language` field: `Arabic`, `Armenian`, `Basque`, `Catalan`, `Danish`, `Dutch`, `English`, `Estonian`, `Finnish`, `French`, `German`, `German2`, `Hungarian`, `Irish`, `Italian`, `Kp`, `Lithuanian`, `Lovins`, `Norwegian`, `Porter`, `Portuguese`, `Romanian`, `Russian`, `Spanish`, `Swedish`, `Turkish`.
`stemmer` | N/A | Provides algorithmic stemming for the following languages in the `language` field: `arabic`, `armenian`, `basque`, `bengali`, `brazilian`, `bulgarian`, `catalan`, `czech`, `danish`, `dutch`, `dutch_kp`, `english`, `light_english`, `lovins`, `minimal_english`, `porter2`, `possessive_english`, `estonian`, `finnish`, `light_finnish`, `french`, `light_french`, `minimal_french`, `galician`, `minimal_galician`, `german`, `german2`, `light_german`, `minimal_german`, `greek`, `hindi`, `hungarian`, `light_hungarian`, `indonesian`, `irish`, `italian`, `light_italian`, `latvian`, `Lithuanian`, `norwegian`, `light_norwegian`, `minimal_norwegian`, `light_nynorsk`, `minimal_nynorsk`, `portuguese`, `light_portuguese`, `minimal_portuguese`, `portuguese_rslp`, `romanian`, `russian`, `light_russian`, `sorani`, `spanish`, `light_spanish`, `swedish`, `light_swedish`, `turkish`.
[`stemmer`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/stemmer/) | N/A | Provides algorithmic stemming for the following languages used in the `language` field: `arabic`, `armenian`, `basque`, `bengali`, `brazilian`, `bulgarian`, `catalan`, `czech`, `danish`, `dutch`, `dutch_kp`, `english`, `light_english`, `lovins`, `minimal_english`, `porter2`, `possessive_english`, `estonian`, `finnish`, `light_finnish`, `french`, `light_french`, `minimal_french`, `galician`, `minimal_galician`, `german`, `german2`, `light_german`, `minimal_german`, `greek`, `hindi`, `hungarian`, `light_hungarian`, `indonesian`, `irish`, `italian`, `light_italian`, `latvian`, `Lithuanian`, `norwegian`, `light_norwegian`, `minimal_norwegian`, `light_nynorsk`, `minimal_nynorsk`, `portuguese`, `light_portuguese`, `minimal_portuguese`, `portuguese_rslp`, `romanian`, `russian`, `light_russian`, `sorani`, `spanish`, `light_spanish`, `swedish`, `light_swedish`, `turkish`.
`stemmer_override` | N/A | Overrides stemming algorithms by applying a custom mapping so that the provided terms are not stemmed.
`stop` | [StopFilter](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/analysis/StopFilter.html) | Removes stop words from a token stream.
[`synonym`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/synonym/) | N/A | Supplies a synonym list for the analysis process. The synonym list is provided using a configuration file.
Expand Down
118 changes: 118 additions & 0 deletions _analyzers/token-filters/stemmer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
---
layout: default
title: Stemmer
parent: Token filters
nav_order: 390
---

# Stemmer token filter

The `stemmer` token filter reduces words to their root or base form (also known as their _stem_).

## Parameters

The `stemmer` token filter can be configured with a `language` parameter that accepts the following values:

- Arabic: `arabic`
- Armenian: `armenian`
- Basque: `basque`
- Bengali: `bengali`
- Brazilian Portuguese: `brazilian`
- Bulgarian: `bulgarian`
- Catalan: `catalan`
- Czech: `czech`
- Danish: `danish`
- Dutch: `dutch, dutch_kp`
- English: `english` (default), `light_english`, `lovins`, `minimal_english`, `porter2`, `possessive_english`
- Estonian: `estonian`
- Finnish: `finnish`, `light_finnish`
- French: `light_french`, `french`, `minimal_french`
- Galician: `galician`, `minimal_galician` (plural step only)
- German: `light_german`, `german`, `german2`, `minimal_german`
- Greek: `greek`
- Hindi: `hindi`
- Hungarian: `hungarian, light_hungarian`
- Indonesian: `indonesian`
- Irish: `irish`
- Italian: `light_italian, italian`
- Kurdish (Sorani): `sorani`
- Latvian: `latvian`
- Lithuanian: `lithuanian`
- Norwegian (Bokmål): `norwegian`, `light_norwegian`, `minimal_norwegian`
- Norwegian (Nynorsk): `light_nynorsk`, `minimal_nynorsk`
- Portuguese: `light_portuguese`, `minimal_portuguese`, `portuguese`, `portuguese_rslp`
- Romanian: `romanian`
- Russian: `russian`, `light_russian`
- Spanish: `light_spanish`, `spanish`
- Swedish: `swedish`, `light_swedish`
- Turkish: `turkish`

You can also use the `name` parameter as an alias for the `language` parameter. If both are set, the `name` parameter is ignored.
{: .note}

## Example

The following example request creates a new index named `my-stemmer-index` and configures an analyzer with a `stemmer` filter:

```json
PUT /my-stemmer-index
{
"settings": {
"analysis": {
"filter": {
"my_english_stemmer": {
"type": "stemmer",
"language": "english"
}
},
"analyzer": {
"my_stemmer_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"my_english_stemmer"
]
}
}
}
}
}
```
{% include copy-curl.html %}

## Generated tokens

Use the following request to examine the tokens generated using the analyzer:

```json
GET /my-stemmer-index/_analyze
{
"analyzer": "my_stemmer_analyzer",
"text": "running runs"
}
```
{% include copy-curl.html %}

The response contains the generated tokens:

```json
{
"tokens": [
{
"token": "run",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "run",
"start_offset": 8,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 1
}
]
}
```

0 comments on commit 74ade40

Please sign in to comment.