Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add stemmer token filter docs #8277 #8444

Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion _analyzers/token-filters/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ Normalization | `arabic_normalization`: [ArabicNormalizer](https://lucene.apache
`reverse` | [ReverseStringFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/reverse/ReverseStringFilter.html) | Reverses the string corresponding to each token in the token stream. For example, the token `dog` becomes `god`.
`shingle` | [ShingleFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/shingle/ShingleFilter.html) | Generates shingles of lengths between `min_shingle_size` and `max_shingle_size` for tokens in the token stream. Shingles are similar to n-grams but apply to words instead of letters. For example, two-word shingles added to the list of unigrams [`contribute`, `to`, `opensearch`] are [`contribute to`, `to opensearch`].
`snowball` | N/A | Stems words using a [Snowball-generated stemmer](https://snowballstem.org/). You can use the `snowball` token filter with the following languages in the `language` field: `Arabic`, `Armenian`, `Basque`, `Catalan`, `Danish`, `Dutch`, `English`, `Estonian`, `Finnish`, `French`, `German`, `German2`, `Hungarian`, `Irish`, `Italian`, `Kp`, `Lithuanian`, `Lovins`, `Norwegian`, `Porter`, `Portuguese`, `Romanian`, `Russian`, `Spanish`, `Swedish`, `Turkish`.
`stemmer` | N/A | Provides algorithmic stemming for the following languages in the `language` field: `arabic`, `armenian`, `basque`, `bengali`, `brazilian`, `bulgarian`, `catalan`, `czech`, `danish`, `dutch`, `dutch_kp`, `english`, `light_english`, `lovins`, `minimal_english`, `porter2`, `possessive_english`, `estonian`, `finnish`, `light_finnish`, `french`, `light_french`, `minimal_french`, `galician`, `minimal_galician`, `german`, `german2`, `light_german`, `minimal_german`, `greek`, `hindi`, `hungarian`, `light_hungarian`, `indonesian`, `irish`, `italian`, `light_italian`, `latvian`, `Lithuanian`, `norwegian`, `light_norwegian`, `minimal_norwegian`, `light_nynorsk`, `minimal_nynorsk`, `portuguese`, `light_portuguese`, `minimal_portuguese`, `portuguese_rslp`, `romanian`, `russian`, `light_russian`, `sorani`, `spanish`, `light_spanish`, `swedish`, `light_swedish`, `turkish`.
[`stemmer`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/stemmer/) | N/A | Provides algorithmic stemming for the following languages in the `language` field: `arabic`, `armenian`, `basque`, `bengali`, `brazilian`, `bulgarian`, `catalan`, `czech`, `danish`, `dutch`, `dutch_kp`, `english`, `light_english`, `lovins`, `minimal_english`, `porter2`, `possessive_english`, `estonian`, `finnish`, `light_finnish`, `french`, `light_french`, `minimal_french`, `galician`, `minimal_galician`, `german`, `german2`, `light_german`, `minimal_german`, `greek`, `hindi`, `hungarian`, `light_hungarian`, `indonesian`, `irish`, `italian`, `light_italian`, `latvian`, `Lithuanian`, `norwegian`, `light_norwegian`, `minimal_norwegian`, `light_nynorsk`, `minimal_nynorsk`, `portuguese`, `light_portuguese`, `minimal_portuguese`, `portuguese_rslp`, `romanian`, `russian`, `light_russian`, `sorani`, `spanish`, `light_spanish`, `swedish`, `light_swedish`, `turkish`.
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved
`stemmer_override` | N/A | Overrides stemming algorithms by applying a custom mapping so that the provided terms are not stemmed.
`stop` | [StopFilter](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/analysis/StopFilter.html) | Removes stop words from a token stream.
`synonym` | N/A | Supplies a synonym list for the analysis process. The synonym list is provided using a configuration file.
Expand Down
118 changes: 118 additions & 0 deletions _analyzers/token-filters/stemmer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
---
layout: default
title: Stemmer
parent: Token filters
nav_order: 390
---

# Stemmer token filter

The `stemmer` token filter reduces words to their root or base form (also known as their _stem_).

## Parameters

The `stemmer` token filter can be configured with a `language` parameter that accepts the following values:

- Arabic: `arabic`
- Armenian: `armenian`
- Basque: `basque`
- Bengali: `bengali`
- Brazilian Portuguese: `brazilian`
- Bulgarian: `bulgarian`
- Catalan: `catalan`
- Czech: `czech`
- Danish: `danish`
- Dutch: `dutch, dutch_kp`
- English: `english` (default), `light_english`, `lovins`, `minimal_english`, `porter2`, `possessive_english`
- Estonian: `estonian`
- Finnish: `finnish`, `light_finnish`
- French: `light_french`, `french`, `minimal_french`
- Galician: `galician`, `minimal_galician` (plural step only)
- German: `light_german`, `german`, `german2`, `minimal_german`
- Greek: `greek`
- Hindi: `hindi`
- Hungarian: `hungarian, light_hungarian`
- Indonesian: `indonesian`
- Irish: `irish`
- Italian: `light_italian, italian`
- Kurdish (Sorani): `sorani`

Check failure on line 38 in _analyzers/token-filters/stemmer.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: Sorani. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: Sorani. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_analyzers/token-filters/stemmer.md", "range": {"start": {"line": 38, "column": 12}}}, "severity": "ERROR"}
- Latvian: `latvian`
- Lithuanian: `lithuanian`
- Norwegian (Bokmål): `norwegian`, `light_norwegian`, `minimal_norwegian`
- Norwegian (Nynorsk): `light_nynorsk`, `minimal_nynorsk`
- Portuguese: `light_portuguese`, `minimal_portuguese`, `portuguese`, `portuguese_rslp`
- Romanian: `romanian`
- Russian: `russian`, `light_russian`
- Spanish: `light_spanish`, `spanish`
- Swedish: `swedish`, `light_swedish`
- Turkish: `turkish`

You can also use the `name` parameter as an alias for the `language` parameter. If both are set, the `name` parameter is ignored.
{: .note}

## Example

The following example request creates a new index named `my-stemmer-index` and configures an analyzer with a `stemmer` filter:

```json
PUT /my-stemmer-index
{
"settings": {
"analysis": {
"filter": {
"my_english_stemmer": {
"type": "stemmer",
"language": "english"
}
},
"analyzer": {
"my_stemmer_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"my_english_stemmer"
]
}
}
}
}
}
```
{% include copy-curl.html %}

## Generated tokens

Use the following request to examine the tokens generated using the analyzer:

```json
GET /my-stemmer-index/_analyze
{
"analyzer": "my_stemmer_analyzer",
"text": "running runs"
}
```
{% include copy-curl.html %}

The response contains the generated tokens:

```json
{
"tokens": [
{
"token": "run",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "run",
"start_offset": 8,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 1
}
]
}
```
Loading