Skip to content

Commit

Permalink
improve wording
Browse files Browse the repository at this point in the history
  • Loading branch information
guimachiavelli committed Sep 13, 2023
1 parent 1d11da0 commit 5211e57
Show file tree
Hide file tree
Showing 2 changed files with 9 additions and 9 deletions.
8 changes: 4 additions & 4 deletions learn/inner_workings/datatypes.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,17 +12,17 @@ String tokenization is the process of **splitting a string into a list of indivi

A string is passed to a tokenizer and is then broken into separate string tokens. A token is a **word**.

- For Latin-based languages, the words are separated by **space**.
- For Kanji characters, the words are separated by **character**.
- For Latin-based languages, the words are separated by **space**
- For Kanji characters, the words are separated by **character**

For Latin-based languages, there are two kinds of **space separators**: soft and hard. Hard separators indicate significant context switch such as a new sentence or paragraph, while soft separators only delimit one word from another.
For Latin-based languages, there are two kinds of **space separators**: soft and hard. Hard separators indicate significant context switch such as a new sentence or paragraph. Soft separators only delimit one word from another.

The list below presents some of the most common separators in languages using the Latin alphabet:

- **Soft spaces** (distance: 1): whitespaces, quotes, `'-' | '_' | '\'' | ':' | '/' | '\\' | '@' | '"' | '+' | '~' | '=' | '^' | '*' | '#'`
- **Hard spaces** (distance: 8): `'.' | ';' | ',' | '!' | '?' | '(' | ')' | '[' | ']' | '{' | '}'| '|'`

For other languages and uncommon Latin alphabet separators, [consult this exhaustive list](https://docs.rs/charabia/0.8.3/src/charabia/separators.rs.html#16-62).
For more separators, including those used in other writing systems, [consult this exhaustive list](https://docs.rs/charabia/0.8.3/src/charabia/separators.rs.html#16-62).

Distance plays an essential role in determining whether documents are relevant since [one of the ranking rules is the **proximity** rule](/learn/core_concepts/relevancy). The proximity rule sorts the results by increasing distance between matched query terms. Then, two words separated by a soft space are closer and thus considered **more relevant** than two words separated by a hard space.

Expand Down
10 changes: 5 additions & 5 deletions reference/api/settings.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -1041,9 +1041,9 @@ You can use this `taskUid` to get more details on [the status of the task](/refe

## Separator tokens

Configure strings as custom separator tokens delimiting when a word ends and begins.
Configure strings as custom separator tokens indicating where a word ends and begins.

Tokens in the `separatorTokens` list are added on top of [Meilisearch's default list of separators](/learn/advanced/datatypes#string). To remove separators from the default list, use the `nonSeparatorTokens` setting.
Tokens in the `separatorTokens` list are added on top of [Meilisearch's default list of separators](/learn/advanced/datatypes#string). To remove separators from the default list, use [the `nonSeparatorTokens` setting](#non-separator-tokens).

### Get separator tokens

Expand Down Expand Up @@ -1085,7 +1085,7 @@ Update an index's list of custom separator tokens.
["|", "…"]
```

An array of strings, with each string indicating a word delimiter.
An array of strings, with each string indicating a word separator.

#### Example

Expand Down Expand Up @@ -1176,10 +1176,10 @@ Update an index's list of non-separator tokens.
#### Body

```
["#"]
["@", "#"]
```

An array of strings, with each string indicating a word delimiter present in [list of word separators](/learn/advanced/datatypes#string).
An array of strings, with each string indicating a token present in [list of word separators](/learn/advanced/datatypes#string).

#### Example

Expand Down

0 comments on commit 5211e57

Please sign in to comment.