diff --git a/learn/inner_workings/datatypes.mdx b/learn/inner_workings/datatypes.mdx index 7f1c12119f..5c6598afc6 100644 --- a/learn/inner_workings/datatypes.mdx +++ b/learn/inner_workings/datatypes.mdx @@ -12,17 +12,17 @@ String tokenization is the process of **splitting a string into a list of indivi A string is passed to a tokenizer and is then broken into separate string tokens. A token is a **word**. -- For Latin-based languages, the words are separated by **space**. -- For Kanji characters, the words are separated by **character**. +- For Latin-based languages, the words are separated by **space** +- For Kanji characters, the words are separated by **character** -For Latin-based languages, there are two kinds of **space separators**: soft and hard. Hard separators indicate significant context switch such as a new sentence or paragraph, while soft separators only delimit one word from another. +For Latin-based languages, there are two kinds of **space separators**: soft and hard. Hard separators indicate significant context switch such as a new sentence or paragraph. Soft separators only delimit one word from another. The list below presents some of the most common separators in languages using the Latin alphabet: - **Soft spaces** (distance: 1): whitespaces, quotes, `'-' | '_' | '\'' | ':' | '/' | '\\' | '@' | '"' | '+' | '~' | '=' | '^' | '*' | '#'` - **Hard spaces** (distance: 8): `'.' | ';' | ',' | '!' | '?' | '(' | ')' | '[' | ']' | '{' | '}'| '|'` -For other languages and uncommon Latin alphabet separators, [consult this exhaustive list](https://docs.rs/charabia/0.8.3/src/charabia/separators.rs.html#16-62). +For more separators, including those used in other writing systems, [consult this exhaustive list](https://docs.rs/charabia/0.8.3/src/charabia/separators.rs.html#16-62). Distance plays an essential role in determining whether documents are relevant since [one of the ranking rules is the **proximity** rule](/learn/core_concepts/relevancy). The proximity rule sorts the results by increasing distance between matched query terms. Then, two words separated by a soft space are closer and thus considered **more relevant** than two words separated by a hard space. diff --git a/reference/api/settings.mdx b/reference/api/settings.mdx index 88d6b8b6c7..cb947dd143 100644 --- a/reference/api/settings.mdx +++ b/reference/api/settings.mdx @@ -1041,9 +1041,9 @@ You can use this `taskUid` to get more details on [the status of the task](/refe ## Separator tokens -Configure strings as custom separator tokens delimiting when a word ends and begins. +Configure strings as custom separator tokens indicating where a word ends and begins. -Tokens in the `separatorTokens` list are added on top of [Meilisearch's default list of separators](/learn/advanced/datatypes#string). To remove separators from the default list, use the `nonSeparatorTokens` setting. +Tokens in the `separatorTokens` list are added on top of [Meilisearch's default list of separators](/learn/advanced/datatypes#string). To remove separators from the default list, use [the `nonSeparatorTokens` setting](#non-separator-tokens). ### Get separator tokens @@ -1085,7 +1085,7 @@ Update an index's list of custom separator tokens. ["|", "…"] ``` -An array of strings, with each string indicating a word delimiter. +An array of strings, with each string indicating a word separator. #### Example @@ -1176,10 +1176,10 @@ Update an index's list of non-separator tokens. #### Body ``` -["#"] +["@", "#"] ``` -An array of strings, with each string indicating a word delimiter present in [list of word separators](/learn/advanced/datatypes#string). +An array of strings, with each string indicating a token present in [list of word separators](/learn/advanced/datatypes#string). #### Example