Simple search interface for large document collections in Elasticsearch. Made for the exploration and analysis of big document leaks. The application is build with Express and Pug. User authentication and protected routes are provided by Passport.
The initial prototype was build to uncover the tax haven in the free trade zone of Madeira. We used Elasticsearch to build a document search for the Madeira Gazette. Many of those big PDF files are simple document scans which we wanted to search for persons and company names. Read the whole story: Madeira – A Tax Haven Approved by the European Commission
Why build another document search engine? – Because it super lightweight and customizable. Until we add more features.
The application is written in JavaScript. You'll need Node.js v6 at least, to run the application. Check out the Node.js installation guide. We use Elasticsearch 2.4 for document storage and search. For further details, please refer to the Elasticsearch installation guide.
To check if your Elasticsearch is up and running, call the REST-Interface from the command line:
$ curl -XGET http://localhost:9200/_cluster/health\?pretty\=1
If you are seeing a Unassigned shards warning, you might consider setting the numbers of replicas to 0. This works fine in a development environment:
$ curl -XPUT 'localhost:9200/_settings' -d '
{
index: {
number_of_replicas : 0
}
}'
To check if your document are all in place, run a simple search query on your index:
$ curl -XGET 'localhost:9200/my-index/_search?q=body:my-query&pretty'
Installation and configuration is straight forward, once Elasticsearch is set up.
- Import documents to Elasticsearch: If you have never done that before, there is another repo dedicated to extracting text from PDF files and importing them to Elasticsearch: elasticsearch-import-tools
- Edit the
config/config.development.js
file. - Start the server:
npm start
. - Go to http://localhost:3000. The default username is
user
and the password ispassword
.
There are four different ways to search for whole sentences (full-text) or a single word (term):
Standard search (full-text search): Finds exact word combinations like John Doe
. Diacritcs are ignored and a search for John Doe
will also find Jóhñ Döé
.
Custom search (full-text search): By default, the custom search finds all documents that contain John
AND Doe
. Supports wildcards and simple search operators:
+
signifies AND operation|
signifies OR operation-
negates a single token"
wraps a number of tokens to signify a phrase for searching*
at the end of a term signifies a prefix query~N
after a word signifies edit distance (fuzziness)~N
after a phrase signifies slop amount
Fuzzy search (term-based search): Finds words, even if they contain a typo or OCR mistake. A search for Jhon
or J°hn
will also find John
.
Regex search (term-based search): Uses Regex patterns like J.h*
for searching. This Regex will find words such as John
, Jahn
and Johnson
.
If you want to change the page title and description, simply update the configuration config/config.development.js
.
config.page = {
title: 'Document Search',
description: 'Search Elasticsearch documents for persons, companies and addresses.'
};
The current authentication strategy is username and password, using passport-local. Passport provides many different authentication strategies as Express middleware. If you want to change the authentication method, go ahead, check out the Passport docs.
For the ease of development, valid users are stored in the configuration config/config.development.js
:
config.users = [
{
id: 1,
username: 'user',
password: '$2a$10$vP0qJyEd0hvvpG5MAaHg9ObUJJpJj9HxINZ/Yqz5nPo5Ms2nhR4r.',
displayName: 'Demo User',
apiToken: '0b414d8433124406be6500833f1672e5'
}
];
New password hashes are created using bcrypt:
const bcrypt = require('bcrypt')
const saltRounds = 10
const myPlaintextPassword = 'password'
const salt = bcrypt.genSaltSync(saltRounds)
const passwordHash = bcrypt.hashSync(myPlaintextPassword, salt)
Note that the list of user could easily be stored in a database like MongoDB.
curl -H "Authorization: Bearer 0b414d8433124406be6500833f1672e5" http://127.0.0.1:3000/api
curl "http://127.0.0.1:3000/api?access_token=0b414d8433124406be6500833f1672e5"
curl -H "Authorization: Bearer 0b414d8433124406be6500833f1672e5" "http://localhost:3000/api/search?query=ciboule&type=match&sorting=date"
To deploy the application in a live environment, create a new configuration config/config.production.js
. Update it with all your server information, Elasticsearch host, credentials etc.
Use the new configuration by starting node with the environment variable set to production
:
$ NODE_ENV=production node bin/www
To keep it running, use a process manager like forever or PM2:
$ NODE_ENV=production forever start bin/www
It's advisable to use SSL/TLS encryption for all connections to the server. One way to do this, is routing your Node.js application through an Apache or Nginx proxy with HTTPS enabled.
The app uses debug as it's core debugging utility. To set the app into debug mode set the environment variable debug
.
export DEBUG=*
If you are on a Winodws machine use:
set DEBUG=*
- Add (inline) document viewer
- Add document import and ingestion
- Add direct API access
- Split data retrieval and rendering
If you are looking for alternatives, check out:
- OCCRP: Aleph, powering the Investigative Dashboard
- ICIJ: Datashare
- EIC: Hoover
- New York Times: Stevedore
- DocumentCloud
- Open Semantic Search
- Overview