separator and non-separator tokens: first draft

meilisearch · Sep 12, 2023 · 1d11da0 · 1d11da0
1 parent 2f0f1e3
commit 1d11da0
Show file tree

Hide file tree

Showing 5 changed files with 231 additions and 1 deletion.
diff --git a/.code-samples.meilisearch.yaml b/.code-samples.meilisearch.yaml
@@ -1120,3 +1120,25 @@ search_parameter_guide_attributes_to_search_on_1: |-
       "q": "adventure",
       "attributesToSearchOn": ["overview"]
     }'
+get_separator_tokens_1: |-
+  curl \
+    -X GET 'http://localhost:7700/indexes/articles/settings/separator-tokens'
+update_separator_tokens_1: |-
+  curl \
+    -X PUT 'http://localhost:7700/indexes/articles/settings/separator-tokens' \
+    -H 'Content-Type: application/json'  \
+    --data-binary '["|", "&hellip;"]'
+reset_separator_tokens_1: |-
+  curl \
+    -X DELETE 'http://localhost:7700/indexes/articles/settings/separator-tokens'
+get_non_separator_tokens_1: |-
+  curl \
+    -X GET 'http://localhost:7700/indexes/articles/settings/non-separator-tokens'
+update_non_separator_tokens_1: |-
+  curl \
+    -X PUT 'http://localhost:7700/indexes/articles/settings/non-separator-tokens' \
+    -H 'Content-Type: application/json'  \
+    --data-binary '["@", "#"]'
+reset_non_separator_tokens_1: |-
+  curl \
+    -X DELETE 'http://localhost:7700/indexes/articles/settings/non-separator-tokens'
diff --git a/learn/inner_workings/datatypes.mdx b/learn/inner_workings/datatypes.mdx
@@ -15,11 +15,15 @@ A string is passed to a tokenizer and is then broken into separate string tokens
 - For Latin-based languages, the words are separated by **space**.
 - For Kanji characters, the words are separated by **character**.
 
-For Latin-based languages, there are two kinds of **space separators**:
+For Latin-based languages, there are two kinds of **space separators**: soft and hard. Hard separators indicate significant context switch such as a new sentence or paragraph, while soft separators only delimit one word from another.
+
+The list below presents some of the most common separators in languages using the Latin alphabet:
 
 - **Soft spaces** (distance: 1): whitespaces, quotes, `'-' | '_' | '\'' | ':' | '/' | '\\' | '@' | '"' | '+' | '~' | '=' | '^' | '*' | '#'`
 - **Hard spaces** (distance: 8): `'.' | ';' | ',' | '!' | '?' | '(' | ')' | '[' | ']' | '{' | '}'| '|'`
 
+For other languages and uncommon Latin alphabet separators, [consult this exhaustive list](https://docs.rs/charabia/0.8.3/src/charabia/separators.rs.html#16-62).
+
 Distance plays an essential role in determining whether documents are relevant since [one of the ranking rules is the **proximity** rule](/learn/core_concepts/relevancy). The proximity rule sorts the results by increasing distance between matched query terms. Then, two words separated by a soft space are closer and thus considered **more relevant** than two words separated by a hard space.
 
 After the tokenizing process, each word is indexed and stored in the global dictionary of the corresponding index.

diff --git a/learn/what_is_meilisearch/telemetry.mdx b/learn/what_is_meilisearch/telemetry.mdx
@@ -197,6 +197,8 @@ This list is liable to change with every new version of Meilisearch. It's not be
 | `displayed_attributes.total`                       | Number of displayed attributes                                                              | 3
 | `displayed_attributes.with_wildcard`               | `true` if `*` is specified as a displayed attribute, otherwise `false`                      | false
 | `stop_words.total`                                 | Number of stop words                                                                        | 3
+| `separator_tokens.total`                           | Number of separator tokens                                                                  | 3
+| `non_separator_tokens.total`                       | Number of non-separator tokens                                                              | 3
 | `synonyms.total`                                   | Number of synonyms                                                                          | 3
 | `per_index_uid`                                    | `true` if the `uid` is used to fetch an index stat resource, otherwise `false`              | false
 | `searches.avg_search_count`                        | The average number of search queries received per call for the aggregated event             | 4.2

diff --git a/reference/api/settings.mdx b/reference/api/settings.mdx
@@ -32,6 +32,8 @@ By default, the settings object looks like this. All fields are modifiable.
     "exactness"
   ],
   "stopWords": [],
+  "nonSeparatorTokens": [],
+  "separatorTokens": [],
   "synonyms": {},
   "distinctAttribute": null,
   "typoTolerance": {
@@ -94,6 +96,8 @@ Get the settings of an index.
     "exactness"
   ],
   "stopWords": [],
+  "nonSeparatorTokens": [],
+  "separatorTokens": [],
   "synonyms": {},
   "distinctAttribute": null,
   "typoTolerance": {
@@ -143,6 +147,8 @@ If the provided index does not exist, it will be created.
 | **[`pagination`](#pagination)**                      | Object           | [Default object](#pagination-object)                                                             | Pagination settings                                                              |
 | **[`rankingRules`](#ranking-rules)**                 | Array of strings | `["words",`<br />`"typo",`<br />`"proximity",`<br />`"attribute",`<br />`"sort",`<br />`"exactness"]` | List of ranking rules in order of importance                                     |
 | **[`searchableAttributes`](#searchable-attributes)** | Array of strings | All attributes: `["*"]`                                                                          | Fields in which to search for matching query words sorted by order of importance |
+| **[`separatorTokens`](#separator-tokens)**           | Array of strings | Empty                                                                                            | List of characters delimiting where one term begins and ends                     |
+| **[`noSeparatorTokens`](#non-separator-tokens)**     | Array of strings | Empty                                                                                            | List of characters not delimiting where one term begins and ends                 |
 | **[`sortableAttributes`](#sortable-attributes)**     | Array of strings | Empty                                                                                            | Attributes to use when sorting search results                                    |
 | **[`stopWords`](#stop-words)**                       | Array of strings | Empty                                                                                            | List of words ignored by Meilisearch when present in search queries              |
 | **[`synonyms`](#synonyms)**                          | Object           | Empty                                                                                            | List of associated words treated similarly                                       |
@@ -1033,6 +1039,196 @@ Reset the searchable attributes of the index to the default value.
 
 You can use this `taskUid` to get more details on [the status of the task](/reference/api/tasks#get-one-task).
 
+## Separator tokens
+
+Configure strings as custom separator tokens delimiting when a word ends and begins.
+
+Tokens in the `separatorTokens` list are added on top of [Meilisearch's default list of separators](/learn/advanced/datatypes#string). To remove separators from the default list, use the `nonSeparatorTokens` setting.
+
+### Get separator tokens
+
+<RouteHighlighter method="GET" route="/indexes/{index_uid}/settings/separator-tokens" />
+
+Get an index's list of custom separator tokens.
+
+#### Path parameters
+
+| Name              | Type   | Description                                                               |
+| :---------------- | :----- | :------------------------------------------------------------------------ |
+| **`index_uid`** * | String | [`uid`](/learn/core_concepts/indexes#index-uid) of the requested index |
+
+#### Example
+
+<CodeSamples id="get_separator_tokens_1"/>
+
+##### Response: `200 Ok`
+
+```json
+[]
+```
+
+### Update separator tokens
+
+<RouteHighlighter method="PUT" route="/indexes/{index_uid}/settings/separator-tokens" />
+
+Update an index's list of custom separator tokens.
+
+#### Path parameters
+
+| Name              | Type   | Description                                                               |
+| :---------------- | :----- | :------------------------------------------------------------------------ |
+| **`index_uid`** * | String | [`uid`](/learn/core_concepts/indexes#index-uid) of the requested index |
+
+#### Body
+
+```
+["|", "&hellip;"]
+```
+
+An array of strings, with each string indicating a word delimiter.
+
+#### Example
+
+<CodeSamples id="update_separator_tokens_1"/>
+
+##### Response: `202 Accepted`
+
+```json
+{
+  "taskUid": 1,
+  "indexUid": "movies",
+  "status": "enqueued",
+  "type": "settingsUpdate",
+  "enqueuedAt": "2021-08-11T09:25:53.000000Z"
+}
+```
+
+Use this `taskUid` to get more details on [the status of the task](/reference/api/tasks#get-one-task).
+
+### Reset separator tokens
+
+<RouteHighlighter method="DELETE" route="/indexes/{index_uid}/settings/separator-tokens"/>
+
+Reset an index's list of custom separator tokens to its default value, `[]`.
+
+#### Path parameters
+
+| Name              | Type   | Description                                                               |
+| :---------------- | :----- | :------------------------------------------------------------------------ |
+| **`index_uid`** * | String | [`uid`](/learn/core_concepts/indexes#index-uid) of the requested index |
+
+#### Example
+
+<CodeSamples id="reset_separator_tokens_1"/>
+
+##### Response: `202 Accepted`
+
+```json
+{
+  "taskUid": 1,
+  "indexUid": "movies",
+  "status": "enqueued",
+  "type": "settingsUpdate",
+  "enqueuedAt": "2021-08-11T09:25:53.000000Z"
+}
+```
+
+Use this `taskUid` to get more details on [the status of the task](/reference/api/tasks#get-one-task).
+
+## Non-separator tokens
+
+Remove tokens from Meilisearch's default [list of word separators](/learn/advanced/datatypes#string).
+
+### Get non-separator tokens
+
+<RouteHighlighter method="GET" route="/indexes/{index_uid}/settings/non-separator-tokens" />
+
+Get an index's list of non-separator tokens.
+
+#### Path parameters
+
+| Name              | Type   | Description                                                               |
+| :---------------- | :----- | :------------------------------------------------------------------------ |
+| **`index_uid`** * | String | [`uid`](/learn/core_concepts/indexes#index-uid) of the requested index |
+
+#### Example
+
+<CodeSamples id="get_non_separator_tokens_1"/>
+
+##### Response: `200 Ok`
+
+```json
+[]
+```
+
+### Update non-separator tokens
+
+<RouteHighlighter method="PUT" route="/indexes/{index_uid}/settings/non-separator-tokens" />
+
+Update an index's list of non-separator tokens.
+
+#### Path parameters
+
+| Name              | Type   | Description                                                               |
+| :---------------- | :----- | :------------------------------------------------------------------------ |
+| **`index_uid`** * | String | [`uid`](/learn/core_concepts/indexes#index-uid) of the requested index |
+
+#### Body
+
+```
+["#"]
+```
+
+An array of strings, with each string indicating a word delimiter present in [list of word separators](/learn/advanced/datatypes#string).
+
+#### Example
+
+<CodeSamples id="update_non_separator_tokens_1"/>
+
+##### Response: `202 Accepted`
+
+```json
+{
+  "taskUid": 1,
+  "indexUid": "movies",
+  "status": "enqueued",
+  "type": "settingsUpdate",
+  "enqueuedAt": "2021-08-11T09:25:53.000000Z"
+}
+```
+
+Use this `taskUid` to get more details on [the status of the task](/reference/api/tasks#get-one-task).
+
+### Reset non-separator tokens
+
+<RouteHighlighter method="DELETE" route="/indexes/{index_uid}/settings/non-separator-tokens"/>
+
+Reset an index's list of non-separator tokens to its default value, `[]`.
+
+#### Path parameters
+
+| Name              | Type   | Description                                                               |
+| :---------------- | :----- | :------------------------------------------------------------------------ |
+| **`index_uid`** * | String | [`uid`](/learn/core_concepts/indexes#index-uid) of the requested index |
+
+#### Example
+
+<CodeSamples id="reset_separator_tokens_1"/>
+
+##### Response: `202 Accepted`
+
+```json
+{
+  "taskUid": 1,
+  "indexUid": "movies",
+  "status": "enqueued",
+  "type": "settingsUpdate",
+  "enqueuedAt": "2021-08-11T09:25:53.000000Z"
+}
+```
+
+Use this `taskUid` to get more details on [the status of the task](/reference/api/tasks#get-one-task).
+
 ## Sortable attributes
 
 Attributes that can be used when sorting search results using the [`sort` search parameter](/reference/api/search#sort).

diff --git a/sample-template.yaml b/sample-template.yaml
@@ -170,3 +170,9 @@ facet_search_1: |-
 facet_search_2: |-
 facet_search_3: |-
 search_parameter_guide_attributes_to_search_on_1: |-
+get_separator_tokens_1: |-
+update_separator_tokens_1: |-
+reset_separator_tokens_1: |-
+get_non_separator_tokens_1: |-
+update_non_separator_tokens_1: |-
+reset_non_separator_tokens_1: |-