Elasticsearch Frontend

Simple search interface for large document collections in Elasticsearch. Made for the exploration and analysis of big document leaks. The application is build with Express and Pug. User authentication and protected routes are provided by Passport.

History

The initial prototype was build to uncover the tax haven in the free trade zone of Madeira. We used Elasticsearch to build a document search for the Madeira Gazette. Many of those big PDF files are simple document scans which we wanted to search for persons and company names. Read the whole story: Madeira – A Tax Haven Approved by the European Commission

Why build another document search engine? – Because it super lightweight and customizable. Until we add more features.

Requirements

The application is written in JavaScript. You'll need Node.js v6 at least, to run the application. Check out the Node.js installation guide. We use Elasticsearch 2.4 for document storage and search. For further details, please refer to the Elasticsearch installation guide.

To check if your Elasticsearch is up and running, call the REST-Interface from the command line:

$ curl -XGET http://localhost:9200/_cluster/health\?pretty\=1

If you are seeing a Unassigned shards warning, you might consider setting the numbers of replicas to 0. This works fine in a development environment:

$ curl -XPUT 'localhost:9200/_settings' -d '         
{                  
  index: {
    number_of_replicas : 0
  }
}'

To check if your document are all in place, run a simple search query on your index:

$ curl -XGET 'localhost:9200/my-index/_search?q=body:my-query&pretty'

Installation

Installation and configuration is straight forward, once Elasticsearch is set up.

Import documents to Elasticsearch: If you have never done that before, there is another repo dedicated to extracting text from PDF files and importing them to Elasticsearch: elasticsearch-import-tools
Edit the config/config.development.js file.
Start the server: npm start.
Go to http://localhost:3000. The default username is user and the password is password.

Searching

There are four different ways to search for whole sentences (full-text) or a single word (term):

Standard search (full-text search): Finds exact word combinations like John Doe. Diacritcs are ignored and a search for John Doe will also find Jóhñ Döé.

Custom search (full-text search): By default, the custom search finds all documents that contain John AND Doe. Supports wildcards and simple search operators:

+ signifies AND operation
| signifies OR operation
- negates a single token
" wraps a number of tokens to signify a phrase for searching
* at the end of a term signifies a prefix query
~N after a word signifies edit distance (fuzziness)
~N after a phrase signifies slop amount

Fuzzy search (term-based search): Finds words, even if they contain a typo or OCR mistake. A search for Jhon or J°hn will also find John.

Regex search (term-based search): Uses Regex patterns like J.h* for searching. This Regex will find words such as John, Jahn and Johnson.

Customization

If you want to change the page title and description, simply update the configuration config/config.development.js.

config.page = {
  title: 'Document Search',
  description: 'Search Elasticsearch documents for persons, companies and addresses.'
};

Authentication

The current authentication strategy is username and password, using passport-local. Passport provides many different authentication strategies as Express middleware. If you want to change the authentication method, go ahead, check out the Passport docs.

For the ease of development, valid users are stored in the configuration config/config.development.js:

config.users = [
  {
    id: 1,
    username: 'user',
    password: '$2a$10$vP0qJyEd0hvvpG5MAaHg9ObUJJpJj9HxINZ/Yqz5nPo5Ms2nhR4r.',
    displayName: 'Demo User',
    apiToken: '0b414d8433124406be6500833f1672e5'
  }
];

New password hashes are created using bcrypt:

const bcrypt = require('bcrypt')
const saltRounds = 10
const myPlaintextPassword = 'password'
const salt = bcrypt.genSaltSync(saltRounds)
const passwordHash = bcrypt.hashSync(myPlaintextPassword, salt)

Note that the list of user could easily be stored in a database like MongoDB.

API

curl -H "Authorization: Bearer 0b414d8433124406be6500833f1672e5" http://127.0.0.1:3000/api
curl "http://127.0.0.1:3000/api?access_token=0b414d8433124406be6500833f1672e5"

curl -H "Authorization: Bearer 0b414d8433124406be6500833f1672e5" "http://localhost:3000/api/search?query=ciboule&type=match&sorting=date"

Deployment

To deploy the application in a live environment, create a new configuration config/config.production.js. Update it with all your server information, Elasticsearch host, credentials etc.

Use the new configuration by starting node with the environment variable set to production:

$ NODE_ENV=production node bin/www

To keep it running, use a process manager like forever or PM2:

$ NODE_ENV=production forever start bin/www

It's advisable to use SSL/TLS encryption for all connections to the server. One way to do this, is routing your Node.js application through an Apache or Nginx proxy with HTTPS enabled.

Debugging

The app uses debug as it's core debugging utility. To set the app into debug mode set the environment variable debug.

export DEBUG=*

If you are on a Winodws machine use:

set DEBUG=*

Planned features

Add (inline) document viewer
Add document import and ingestion
Add direct API access
Split data retrieval and rendering

Similar projects:

If you are looking for alternatives, check out:

OCCRP: Aleph, powering the Investigative Dashboard
ICIJ: Datashare
EIC: Hoover
New York Times: Stevedore
DocumentCloud
Open Semantic Search
Overview

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
bin		bin
config		config
lib		lib
public		public
routes		routes
views		views
.editorconfig		.editorconfig
.eslintignore		.eslintignore
.eslintrc		.eslintrc
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
app.js		app.js
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Elasticsearch Frontend

History

Requirements

Installation

Searching

Customization

Authentication

API

Deployment

Debugging

Planned features

Similar projects:

About

Releases 2

Packages

Contributors 2

Languages

License

br-data/elasticsearch-frontend

Folders and files

Latest commit

History

Repository files navigation

Elasticsearch Frontend

History

Requirements

Installation

Searching

Customization

Authentication

API

Deployment

Debugging

Planned features

Similar projects:

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Languages

Packages