Skip to content

Play with Elasticsearch querying data from UCI Machine Learning Repository

Notifications You must be signed in to change notification settings

josalmi/es-movies

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hands on Elasticsearch

Elasticsearch is a good fulltext search engine.

  • Wikipedia search is powered by Elasticsearch.
  • The Guardian joins access log data with social network data using Elasticsearch to give editors an idea of how public is reponding to articles.
  • StackOverflow fulltext search is powered by Elasticsearch. They use the more like this feature to find similar answers.
  • GitHub uses Elasticsearch to query 130 billion lines of code

Prerequisites

Docker and Python 2.7 with pip or easy_istall and internet access.

  1. Get code. git clone [email protected]:josalmi/es-movies.git
  2. Fire up elasticsearch. docker-compose up
  3. Open shell in client container: docker-compose run client /bin/bash
  4. Load data. ./init.sh
  5. Profit

Excercises

We are using UCI Movies Dataset of over 10k films. The titles are from late 1800's to 1999.

Find all the Academy Awards winners in the database. AA stands for winning an Academy Award.

Find the film Elmer Gantry in the raw data. Did it win an Academy Award?

  1. Find all the Academny Award winners excluding those who were just nominated (AAN).
  2. Try to filter all those movies which contain the word 'Vampire'. How many are there? What's up with the score.
  1. The Best films are not in any particular order. Let's see if we can use a function score to order the results after matches have been made. Perhaps the field_value_factor or the decay functions can help us order our movies.

  2. Something isn't right. Let's look at what our index looks like. curl http://localhost:9200/movies. What's the problem?

Creating an index mapping.

Tuning relevance in Elasticsearch is a dance between the index and the query. Let's add some mappings! In order to change the mappings, we will create a new index named 1. There are some ready made mappings. But is there something we should change to make the function score work?

./et create index 1 ./et reindex 0 1 ./et index alias movies 1 0

Find academy award winners in drama category?

It's a long way from V to Vampire

Once you start typing into the typeahead field the experience isn't very satisfying. Let's create a typeahead index.

Let's add language analyzers into the mix

They have inherent weaknesses, so let's add the original field to the side of the analyzed one.

Bigrams

Exact phrase matching

Fuzzy query and minimum should match

About

Play with Elasticsearch querying data from UCI Machine Learning Repository

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 38.2%
  • Python 33.1%
  • Shell 28.7%