Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.4: Separator and non-separator tokens #2553

Merged
merged 5 commits into from
Sep 14, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions .code-samples.meilisearch.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1120,6 +1120,28 @@ search_parameter_guide_attributes_to_search_on_1: |-
"q": "adventure",
"attributesToSearchOn": ["overview"]
}'
get_separator_tokens_1: |-
curl \
-X GET 'http://localhost:7700/indexes/articles/settings/separator-tokens'
update_separator_tokens_1: |-
curl \
-X PUT 'http://localhost:7700/indexes/articles/settings/separator-tokens' \
-H 'Content-Type: application/json' \
--data-binary '["|", "…"]'
reset_separator_tokens_1: |-
curl \
-X DELETE 'http://localhost:7700/indexes/articles/settings/separator-tokens'
get_non_separator_tokens_1: |-
curl \
-X GET 'http://localhost:7700/indexes/articles/settings/non-separator-tokens'
update_non_separator_tokens_1: |-
curl \
-X PUT 'http://localhost:7700/indexes/articles/settings/non-separator-tokens' \
-H 'Content-Type: application/json' \
--data-binary '["@", "#"]'
reset_non_separator_tokens_1: |-
curl \
-X DELETE 'http://localhost:7700/indexes/articles/settings/non-separator-tokens'
get_dictionary_1: |-
curl \
-X GET 'http://localhost:7700/indexes/books/settings/dictionary'
Expand Down
23 changes: 20 additions & 3 deletions learn/inner_workings/datatypes.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,31 @@ String tokenization is the process of **splitting a string into a list of indivi

A string is passed to a tokenizer and is then broken into separate string tokens. A token is a **word**.

- For Latin-based languages, the words are separated by **space**.
- For Kanji characters, the words are separated by **character**.
### Tokenization

For Latin-based languages, there are two kinds of **space separators**:
Tokenization relies on two main processes to identifying words and separating them into tokens: separators and dictionaries.

#### Separators

Separators are characters that indicate where one word ends and another word begins. In languages using the Latin alphabet, for example, words are usually delimited by white space. In Japanese, word boundaries are more commonly indicated in other ways, such as appending particles like `に` and `で` to the end of a word.

There are two kinds of separators in Meilisearch: soft and hard. Hard separators signal a significant context switch such as a new sentence or paragraph. Soft separators only delimit one word from another but do not imply a major change of subject.

The list below presents some of the most common separators in languages using the Latin alphabet:

- **Soft spaces** (distance: 1): whitespaces, quotes, `'-' | '_' | '\'' | ':' | '/' | '\\' | '@' | '"' | '+' | '~' | '=' | '^' | '*' | '#'`
- **Hard spaces** (distance: 8): `'.' | ';' | ',' | '!' | '?' | '(' | ')' | '[' | ']' | '{' | '}'| '|'`

For more separators, including those used in other writing systems like Cyrillic and Thai, [consult this exhaustive list](https://docs.rs/charabia/0.8.3/src/charabia/separators.rs.html#16-62).

#### Dictionaries

For the tokenization process, dictionaries are lists of groups of characters which should be considered as single term. Dictionaries are particularly useful when identifying words in languages like Japanese, where words are not always marked by separator tokens.

Meilisearch comes with a number of general-use dictionaries for its officially supported languages. When working with documents containing many domain-specific terms, such as a legal documents or academic papers, providing a [custom dictionary](/reference/api/settings#dictionary) may improve search result relevancy.

### Distance

Distance plays an essential role in determining whether documents are relevant since [one of the ranking rules is the **proximity** rule](/learn/core_concepts/relevancy). The proximity rule sorts the results by increasing distance between matched query terms. Then, two words separated by a soft space are closer and thus considered **more relevant** than two words separated by a hard space.

After the tokenizing process, each word is indexed and stored in the global dictionary of the corresponding index.
Expand Down
2 changes: 2 additions & 0 deletions learn/what_is_meilisearch/telemetry.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -197,6 +197,8 @@ This list is liable to change with every new version of Meilisearch. It's not be
| `displayed_attributes.total` | Number of displayed attributes | 3
| `displayed_attributes.with_wildcard` | `true` if `*` is specified as a displayed attribute, otherwise `false` | false
| `stop_words.total` | Number of stop words | 3
| `separator_tokens.total` | Number of separator tokens | 3
| `non_separator_tokens.total` | Number of non-separator tokens | 3
| `dictionary.total` | Number of words in the dictionary | 3
| `synonyms.total` | Number of synonyms | 3
| `per_index_uid` | `true` if the `uid` is used to fetch an index stat resource, otherwise `false` | false
Expand Down
196 changes: 196 additions & 0 deletions reference/api/settings.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,8 @@ By default, the settings object looks like this. All fields are modifiable.
"exactness"
],
"stopWords": [],
"nonSeparatorTokens": [],
"separatorTokens": [],
"dictionary": [],
"synonyms": {},
"distinctAttribute": null,
Expand Down Expand Up @@ -95,6 +97,8 @@ Get the settings of an index.
"exactness"
],
"stopWords": [],
"nonSeparatorTokens": [],
"separatorTokens": [],
"dictionary": [],
"synonyms": {},
"distinctAttribute": null,
Expand Down Expand Up @@ -146,6 +150,8 @@ If the provided index does not exist, it will be created.
| **[`pagination`](#pagination)** | Object | [Default object](#pagination-object) | Pagination settings |
| **[`rankingRules`](#ranking-rules)** | Array of strings | `["words",`<br />`"typo",`<br />`"proximity",`<br />`"attribute",`<br />`"sort",`<br />`"exactness"]` | List of ranking rules in order of importance |
| **[`searchableAttributes`](#searchable-attributes)** | Array of strings | All attributes: `["*"]` | Fields in which to search for matching query words sorted by order of importance |
| **[`separatorTokens`](#separator-tokens)** | Array of strings | Empty | List of characters delimiting where one term begins and ends |
| **[`noSeparatorTokens`](#non-separator-tokens)** | Array of strings | Empty | List of characters not delimiting where one term begins and ends |
| **[`sortableAttributes`](#sortable-attributes)** | Array of strings | Empty | Attributes to use when sorting search results |
| **[`stopWords`](#stop-words)** | Array of strings | Empty | List of words ignored by Meilisearch when present in search queries |
| **[`synonyms`](#synonyms)** | Object | Empty | List of associated words treated similarly |
Expand Down Expand Up @@ -1144,6 +1150,196 @@ Reset the searchable attributes of the index to the default value.

You can use this `taskUid` to get more details on [the status of the task](/reference/api/tasks#get-one-task).

## Separator tokens

Configure strings as custom separator tokens indicating where a word ends and begins.

Tokens in the `separatorTokens` list are added on top of [Meilisearch's default list of separators](/learn/advanced/datatypes#string). To remove separators from the default list, use [the `nonSeparatorTokens` setting](#non-separator-tokens).

### Get separator tokens

<RouteHighlighter method="GET" route="/indexes/{index_uid}/settings/separator-tokens" />

Get an index's list of custom separator tokens.

#### Path parameters

| Name | Type | Description |
| :---------------- | :----- | :------------------------------------------------------------------------ |
| **`index_uid`** * | String | [`uid`](/learn/core_concepts/indexes#index-uid) of the requested index |

#### Example

<CodeSamples id="get_separator_tokens_1"/>

##### Response: `200 Ok`

```json
[]
```

### Update separator tokens

<RouteHighlighter method="PUT" route="/indexes/{index_uid}/settings/separator-tokens" />

Update an index's list of custom separator tokens.

#### Path parameters

| Name | Type | Description |
| :---------------- | :----- | :------------------------------------------------------------------------ |
| **`index_uid`** * | String | [`uid`](/learn/core_concepts/indexes#index-uid) of the requested index |

#### Body

```
["|", "&hellip;"]
```

An array of strings, with each string indicating a word separator.

#### Example

<CodeSamples id="update_separator_tokens_1"/>

##### Response: `202 Accepted`

```json
{
"taskUid": 1,
"indexUid": "movies",
"status": "enqueued",
"type": "settingsUpdate",
"enqueuedAt": "2021-08-11T09:25:53.000000Z"
}
```

Use this `taskUid` to get more details on [the status of the task](/reference/api/tasks#get-one-task).

### Reset separator tokens

<RouteHighlighter method="DELETE" route="/indexes/{index_uid}/settings/separator-tokens"/>

Reset an index's list of custom separator tokens to its default value, `[]`.

#### Path parameters

| Name | Type | Description |
| :---------------- | :----- | :------------------------------------------------------------------------ |
| **`index_uid`** * | String | [`uid`](/learn/core_concepts/indexes#index-uid) of the requested index |

#### Example

<CodeSamples id="reset_separator_tokens_1"/>

##### Response: `202 Accepted`

```json
{
"taskUid": 1,
"indexUid": "movies",
"status": "enqueued",
"type": "settingsUpdate",
"enqueuedAt": "2021-08-11T09:25:53.000000Z"
}
```

Use this `taskUid` to get more details on [the status of the task](/reference/api/tasks#get-one-task).

## Non-separator tokens

Remove tokens from Meilisearch's default [list of word separators](/learn/advanced/datatypes#string).

### Get non-separator tokens

<RouteHighlighter method="GET" route="/indexes/{index_uid}/settings/non-separator-tokens" />

Get an index's list of non-separator tokens.

#### Path parameters

| Name | Type | Description |
| :---------------- | :----- | :------------------------------------------------------------------------ |
| **`index_uid`** * | String | [`uid`](/learn/core_concepts/indexes#index-uid) of the requested index |

#### Example

<CodeSamples id="get_non_separator_tokens_1"/>

##### Response: `200 Ok`

```json
[]
```

### Update non-separator tokens

<RouteHighlighter method="PUT" route="/indexes/{index_uid}/settings/non-separator-tokens" />

Update an index's list of non-separator tokens.

#### Path parameters

| Name | Type | Description |
| :---------------- | :----- | :------------------------------------------------------------------------ |
| **`index_uid`** * | String | [`uid`](/learn/core_concepts/indexes#index-uid) of the requested index |

#### Body

```
["@", "#"]
```

An array of strings, with each string indicating a token present in [list of word separators](/learn/advanced/datatypes#string).

#### Example

<CodeSamples id="update_non_separator_tokens_1"/>

##### Response: `202 Accepted`

```json
{
"taskUid": 1,
"indexUid": "movies",
"status": "enqueued",
"type": "settingsUpdate",
"enqueuedAt": "2021-08-11T09:25:53.000000Z"
}
```

Use this `taskUid` to get more details on [the status of the task](/reference/api/tasks#get-one-task).

### Reset non-separator tokens

<RouteHighlighter method="DELETE" route="/indexes/{index_uid}/settings/non-separator-tokens"/>

Reset an index's list of non-separator tokens to its default value, `[]`.

#### Path parameters

| Name | Type | Description |
| :---------------- | :----- | :------------------------------------------------------------------------ |
| **`index_uid`** * | String | [`uid`](/learn/core_concepts/indexes#index-uid) of the requested index |

#### Example

<CodeSamples id="reset_separator_tokens_1"/>

##### Response: `202 Accepted`

```json
{
"taskUid": 1,
"indexUid": "movies",
"status": "enqueued",
"type": "settingsUpdate",
"enqueuedAt": "2021-08-11T09:25:53.000000Z"
}
```

Use this `taskUid` to get more details on [the status of the task](/reference/api/tasks#get-one-task).

## Sortable attributes

Attributes that can be used when sorting search results using the [`sort` search parameter](/reference/api/search#sort).
Expand Down