Skip to content

Commit

Permalink
Add Region support and improve the README
Browse files Browse the repository at this point in the history
  • Loading branch information
damienalexandre committed Jan 19, 2018
1 parent 33e0bb5 commit 0c9238a
Show file tree
Hide file tree
Showing 30 changed files with 7,332 additions and 78 deletions.
91 changes: 22 additions & 69 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,22 @@
# Emoji synonyms dictionary and custom tokenizer plugin for Elasticsearch
> Add support for emoji in any Lucene compatible search engine!
# Emoji, flags and emoticons support for Elasticsearch

## What is this
> Add support for emoji and flags in any Lucene compatible search engine!
This repository host information about Elasticsearch and emoji search:
If you wish to search `🍩` to find **donuts** in your documents, you came to the right place.

- [synonym files](/synonyms) in Solr / Lucene format for emoji search in all languages supported by Unicode CLDR;
- emoticon suggestions for improved meaning extraction;
- full elasticsearch analyzer configuration to copy and paste;
- an [experimental tokenizer plugin](/esplugin) for Elasticsearch (help needed :warning:).
## [The `analysis-emoji` Plugin](/esplugin)

Emoji data are based on the latest [CLDR data set](http://cldr.unicode.org/) (Currently version 30.0.2 stable).
To index emoji, you need a custom Tokenizer which is not considering them as punctuation. You can either build an analyzer with the whitespace tokenizer [as described here](http://jolicode.com/blog/search-for-emoji-with-elasticsearch), or **use this plugin**.

The plugin expose a new `emoji_tokenizer`, based on `icu_tokenizer` but with custom BreakIterator rules to keep emoji!

[Head over the `/esplugin` directory for installation instructions](/esplugin).

## The Synonyms, flags and emoticons

Once you have a `🍩` token, you need to expand it to the token "donut", in **your language**. That's the goal of the [synonym dictionaries](/synonyms).

We build Solr / Lucene compatible synonyms files in all languages supported by [Unicode CLDR](http://cldr.unicode.org/) so you can set them up in an analyzer. It looks like this:

```
👩‍🚒 => 👩‍🚒, firefighter, firetruck, woman
Expand All @@ -24,15 +30,11 @@ Emoji data are based on the latest [CLDR data set](http://cldr.unicode.org/) (Cu
🇬🇧 => 🇬🇧, united kingdom
```

**Learn more about this in our [blog post describing how to search with emoji in Elasticsearch](http://jolicode.com/blog/search-for-emoji-with-elasticsearch) (2016).**

## Emoji analyzer for Elasticsearch (with the `analysis-emoji` plugin)

Go to the [dedicated plugin documentation](esplugin/README.md).
For emoticons, use [this mapping](emoticons.txt) with a char_filter to replace emoticons by emoji.

## Emoji analyzer for Elasticsearch (without the plugin, not perfect)
**Learn more about this in our [blog post describing how to search with emoji in Elasticsearch](http://jolicode.com/blog/search-for-emoji-with-elasticsearch) (2016).**

### Get the files in ./config/analysis/
### Getting started

Download the emoji and emoticon file you want from this repository and store them in `PATH_ES/config/analysis`.

Expand All @@ -45,22 +47,14 @@ config
...
```

### Create the analyzer

We call it `english_with_emoji` here because we use the english synonyms:
Use them like this:

```json
PUT /en-emoji
{
"settings": {
"analysis": {
"char_filter": {
"zwj_char_filter": {
"type": "mapping",
"mappings": [
"\\u200D=>"
]
},
"emoticons_char_filter": {
"type": "mapping",
"mappings_path": "analysis/emoticons.txt"
Expand All @@ -70,55 +64,14 @@ PUT /en-emoji
"english_emoji": {
"type": "synonym",
"synonyms_path": "analysis/cldr-emoji-annotation-synonyms-en.txt"
},
"punctuation_and_modifiers_filter": {
"type": "pattern_replace",
"pattern": "\\p{Punct}|\\uFE0E|\\uFE0F|\\uD83C\\uDFFB|\\uD83C\\uDFFC|\\uD83C\\uDFFD|\\uD83C\\uDFFE|\\uD83C\\uDFFF",
"replace": ""
},
"remove_empty_filter": {
"type": "length",
"min": 1
}
},
"analyzer": {
"english_with_emoji": {
"char_filter": ["zwj_char_filter", "emoticons_char_filter"],
"tokenizer": "whitespace",
"filter": [
"lowercase",
"punctuation_and_modifiers_filter",
"remove_empty_filter",
"english_emoji"
]
}
}
}
}
}
```

### Try it!

```json
GET /en-emoji/_analyze?analyzer=english_with_emoji
{
"text": "I love 🍩"
}
# Result: i, love, 🍩, dessert, donut, sweet

GET /en-emoji/_analyze?analyzer=english_with_emoji
{
"text": "You are ]:)"
}
# Result: you, are, 😈, face, fairy, fantasy, horns, smile, tale

GET /en-emoji/_analyze?analyzer=english_with_emoji
{
"text": "Where is 🇫🇮?"
}
# Result: where, is, 🇫🇮, finland
```
[Head over the `/esplugin` directory for a fully functional mapping](/esplugin).

## How to contribute

Expand All @@ -127,9 +80,9 @@ GET /en-emoji/_analyze?analyzer=english_with_emoji
You will need:

- php cli
- svn
- php zip and curl extensions

Edit the tag in `tools/build-beta.php` and run `php tools/build-beta.php`.
Edit the tag in `tools/build-released.php` and run `php tools/build-released.php`.

### Update emoticons

Expand Down
9 changes: 4 additions & 5 deletions esplugin/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Elasticsearch analysis-emoji plugin
# Elasticsearch `analysis-emoji` plugin

This plugin create a new Tokenizer called `emoji_tokenizer` based on `icu_tokenizer` and the latest (59.1) ICU data.

Expand All @@ -17,7 +17,7 @@ bin/elasticsearch-plugin install https://github.com/jolicode/emoji-search/releas

## Versions

ICU is always up to date to the latest data in this plugin, so upgrading may require you to re-index your data.
ICU is _always_ up to date to the latest data in this plugin, so upgrading may require you to re-index your data.

analysis-emoji version and ES version | Install URL
-----------|-----------
Expand All @@ -38,9 +38,7 @@ analysis-emoji version and ES version | Install URL

## How to use

Build your own analyzer and use the new tokenizer. Look at the main [README](../README.md) for more informations.

Download the emoji and emoticon file you want from this repository and store them in `PATH_ES/config/analysis`.
Build your own analyzer and use the new tokenizer. Download the emoji and emoticon file you want from this repository and store them in `PATH_ES/config/analysis`.

```
config
Expand Down Expand Up @@ -84,6 +82,7 @@ PUT /en-emoji
}
}
```

Try it:

```json
Expand Down
Loading

0 comments on commit 0c9238a

Please sign in to comment.