Skip to content

Commit

Permalink
separator and non-separator tokens: first draft
Browse files Browse the repository at this point in the history
  • Loading branch information
guimachiavelli committed Sep 12, 2023
1 parent 2f0f1e3 commit 1d11da0
Show file tree
Hide file tree
Showing 5 changed files with 231 additions and 1 deletion.
22 changes: 22 additions & 0 deletions .code-samples.meilisearch.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1120,3 +1120,25 @@ search_parameter_guide_attributes_to_search_on_1: |-
"q": "adventure",
"attributesToSearchOn": ["overview"]
}'
get_separator_tokens_1: |-
curl \
-X GET 'http://localhost:7700/indexes/articles/settings/separator-tokens'
update_separator_tokens_1: |-
curl \
-X PUT 'http://localhost:7700/indexes/articles/settings/separator-tokens' \
-H 'Content-Type: application/json' \
--data-binary '["|", "…"]'
reset_separator_tokens_1: |-
curl \
-X DELETE 'http://localhost:7700/indexes/articles/settings/separator-tokens'
get_non_separator_tokens_1: |-
curl \
-X GET 'http://localhost:7700/indexes/articles/settings/non-separator-tokens'
update_non_separator_tokens_1: |-
curl \
-X PUT 'http://localhost:7700/indexes/articles/settings/non-separator-tokens' \
-H 'Content-Type: application/json' \
--data-binary '["@", "#"]'
reset_non_separator_tokens_1: |-
curl \
-X DELETE 'http://localhost:7700/indexes/articles/settings/non-separator-tokens'
6 changes: 5 additions & 1 deletion learn/inner_workings/datatypes.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,15 @@ A string is passed to a tokenizer and is then broken into separate string tokens
- For Latin-based languages, the words are separated by **space**.
- For Kanji characters, the words are separated by **character**.

For Latin-based languages, there are two kinds of **space separators**:
For Latin-based languages, there are two kinds of **space separators**: soft and hard. Hard separators indicate significant context switch such as a new sentence or paragraph, while soft separators only delimit one word from another.

The list below presents some of the most common separators in languages using the Latin alphabet:

- **Soft spaces** (distance: 1): whitespaces, quotes, `'-' | '_' | '\'' | ':' | '/' | '\\' | '@' | '"' | '+' | '~' | '=' | '^' | '*' | '#'`
- **Hard spaces** (distance: 8): `'.' | ';' | ',' | '!' | '?' | '(' | ')' | '[' | ']' | '{' | '}'| '|'`

For other languages and uncommon Latin alphabet separators, [consult this exhaustive list](https://docs.rs/charabia/0.8.3/src/charabia/separators.rs.html#16-62).

Distance plays an essential role in determining whether documents are relevant since [one of the ranking rules is the **proximity** rule](/learn/core_concepts/relevancy). The proximity rule sorts the results by increasing distance between matched query terms. Then, two words separated by a soft space are closer and thus considered **more relevant** than two words separated by a hard space.

After the tokenizing process, each word is indexed and stored in the global dictionary of the corresponding index.
Expand Down
2 changes: 2 additions & 0 deletions learn/what_is_meilisearch/telemetry.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -197,6 +197,8 @@ This list is liable to change with every new version of Meilisearch. It's not be
| `displayed_attributes.total` | Number of displayed attributes | 3
| `displayed_attributes.with_wildcard` | `true` if `*` is specified as a displayed attribute, otherwise `false` | false
| `stop_words.total` | Number of stop words | 3
| `separator_tokens.total` | Number of separator tokens | 3
| `non_separator_tokens.total` | Number of non-separator tokens | 3
| `synonyms.total` | Number of synonyms | 3
| `per_index_uid` | `true` if the `uid` is used to fetch an index stat resource, otherwise `false` | false
| `searches.avg_search_count` | The average number of search queries received per call for the aggregated event | 4.2
Expand Down
196 changes: 196 additions & 0 deletions reference/api/settings.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,8 @@ By default, the settings object looks like this. All fields are modifiable.
"exactness"
],
"stopWords": [],
"nonSeparatorTokens": [],
"separatorTokens": [],
"synonyms": {},
"distinctAttribute": null,
"typoTolerance": {
Expand Down Expand Up @@ -94,6 +96,8 @@ Get the settings of an index.
"exactness"
],
"stopWords": [],
"nonSeparatorTokens": [],
"separatorTokens": [],
"synonyms": {},
"distinctAttribute": null,
"typoTolerance": {
Expand Down Expand Up @@ -143,6 +147,8 @@ If the provided index does not exist, it will be created.
| **[`pagination`](#pagination)** | Object | [Default object](#pagination-object) | Pagination settings |
| **[`rankingRules`](#ranking-rules)** | Array of strings | `["words",`<br />`"typo",`<br />`"proximity",`<br />`"attribute",`<br />`"sort",`<br />`"exactness"]` | List of ranking rules in order of importance |
| **[`searchableAttributes`](#searchable-attributes)** | Array of strings | All attributes: `["*"]` | Fields in which to search for matching query words sorted by order of importance |
| **[`separatorTokens`](#separator-tokens)** | Array of strings | Empty | List of characters delimiting where one term begins and ends |
| **[`noSeparatorTokens`](#non-separator-tokens)** | Array of strings | Empty | List of characters not delimiting where one term begins and ends |
| **[`sortableAttributes`](#sortable-attributes)** | Array of strings | Empty | Attributes to use when sorting search results |
| **[`stopWords`](#stop-words)** | Array of strings | Empty | List of words ignored by Meilisearch when present in search queries |
| **[`synonyms`](#synonyms)** | Object | Empty | List of associated words treated similarly |
Expand Down Expand Up @@ -1033,6 +1039,196 @@ Reset the searchable attributes of the index to the default value.

You can use this `taskUid` to get more details on [the status of the task](/reference/api/tasks#get-one-task).

## Separator tokens

Configure strings as custom separator tokens delimiting when a word ends and begins.

Tokens in the `separatorTokens` list are added on top of [Meilisearch's default list of separators](/learn/advanced/datatypes#string). To remove separators from the default list, use the `nonSeparatorTokens` setting.

### Get separator tokens

<RouteHighlighter method="GET" route="/indexes/{index_uid}/settings/separator-tokens" />

Get an index's list of custom separator tokens.

#### Path parameters

| Name | Type | Description |
| :---------------- | :----- | :------------------------------------------------------------------------ |
| **`index_uid`** * | String | [`uid`](/learn/core_concepts/indexes#index-uid) of the requested index |

#### Example

<CodeSamples id="get_separator_tokens_1"/>

##### Response: `200 Ok`

```json
[]
```

### Update separator tokens

<RouteHighlighter method="PUT" route="/indexes/{index_uid}/settings/separator-tokens" />

Update an index's list of custom separator tokens.

#### Path parameters

| Name | Type | Description |
| :---------------- | :----- | :------------------------------------------------------------------------ |
| **`index_uid`** * | String | [`uid`](/learn/core_concepts/indexes#index-uid) of the requested index |

#### Body

```
["|", "&hellip;"]
```

An array of strings, with each string indicating a word delimiter.

#### Example

<CodeSamples id="update_separator_tokens_1"/>

##### Response: `202 Accepted`

```json
{
"taskUid": 1,
"indexUid": "movies",
"status": "enqueued",
"type": "settingsUpdate",
"enqueuedAt": "2021-08-11T09:25:53.000000Z"
}
```

Use this `taskUid` to get more details on [the status of the task](/reference/api/tasks#get-one-task).

### Reset separator tokens

<RouteHighlighter method="DELETE" route="/indexes/{index_uid}/settings/separator-tokens"/>

Reset an index's list of custom separator tokens to its default value, `[]`.

#### Path parameters

| Name | Type | Description |
| :---------------- | :----- | :------------------------------------------------------------------------ |
| **`index_uid`** * | String | [`uid`](/learn/core_concepts/indexes#index-uid) of the requested index |

#### Example

<CodeSamples id="reset_separator_tokens_1"/>

##### Response: `202 Accepted`

```json
{
"taskUid": 1,
"indexUid": "movies",
"status": "enqueued",
"type": "settingsUpdate",
"enqueuedAt": "2021-08-11T09:25:53.000000Z"
}
```

Use this `taskUid` to get more details on [the status of the task](/reference/api/tasks#get-one-task).

## Non-separator tokens

Remove tokens from Meilisearch's default [list of word separators](/learn/advanced/datatypes#string).

### Get non-separator tokens

<RouteHighlighter method="GET" route="/indexes/{index_uid}/settings/non-separator-tokens" />

Get an index's list of non-separator tokens.

#### Path parameters

| Name | Type | Description |
| :---------------- | :----- | :------------------------------------------------------------------------ |
| **`index_uid`** * | String | [`uid`](/learn/core_concepts/indexes#index-uid) of the requested index |

#### Example

<CodeSamples id="get_non_separator_tokens_1"/>

##### Response: `200 Ok`

```json
[]
```

### Update non-separator tokens

<RouteHighlighter method="PUT" route="/indexes/{index_uid}/settings/non-separator-tokens" />

Update an index's list of non-separator tokens.

#### Path parameters

| Name | Type | Description |
| :---------------- | :----- | :------------------------------------------------------------------------ |
| **`index_uid`** * | String | [`uid`](/learn/core_concepts/indexes#index-uid) of the requested index |

#### Body

```
["#"]
```

An array of strings, with each string indicating a word delimiter present in [list of word separators](/learn/advanced/datatypes#string).

#### Example

<CodeSamples id="update_non_separator_tokens_1"/>

##### Response: `202 Accepted`

```json
{
"taskUid": 1,
"indexUid": "movies",
"status": "enqueued",
"type": "settingsUpdate",
"enqueuedAt": "2021-08-11T09:25:53.000000Z"
}
```

Use this `taskUid` to get more details on [the status of the task](/reference/api/tasks#get-one-task).

### Reset non-separator tokens

<RouteHighlighter method="DELETE" route="/indexes/{index_uid}/settings/non-separator-tokens"/>

Reset an index's list of non-separator tokens to its default value, `[]`.

#### Path parameters

| Name | Type | Description |
| :---------------- | :----- | :------------------------------------------------------------------------ |
| **`index_uid`** * | String | [`uid`](/learn/core_concepts/indexes#index-uid) of the requested index |

#### Example

<CodeSamples id="reset_separator_tokens_1"/>

##### Response: `202 Accepted`

```json
{
"taskUid": 1,
"indexUid": "movies",
"status": "enqueued",
"type": "settingsUpdate",
"enqueuedAt": "2021-08-11T09:25:53.000000Z"
}
```

Use this `taskUid` to get more details on [the status of the task](/reference/api/tasks#get-one-task).

## Sortable attributes

Attributes that can be used when sorting search results using the [`sort` search parameter](/reference/api/search#sort).
Expand Down
6 changes: 6 additions & 0 deletions sample-template.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -170,3 +170,9 @@ facet_search_1: |-
facet_search_2: |-
facet_search_3: |-
search_parameter_guide_attributes_to_search_on_1: |-
get_separator_tokens_1: |-
update_separator_tokens_1: |-
reset_separator_tokens_1: |-
get_non_separator_tokens_1: |-
update_non_separator_tokens_1: |-
reset_non_separator_tokens_1: |-

0 comments on commit 1d11da0

Please sign in to comment.