Extract text and meta data from documents and import them to Elasticsearch. Quite useful, if you want to analyze a big document leak. The toolset uses Apache Tika and Tesseract for text extraction and OCR.
- Clone the repository
git clone https://...
- Install dependencies
npm install
- Run scripts, e.g.
node extract.js ./pdf ./text 'POR'
All scripts are written in JavaScript. To run them, you'll need at least Node.js v6. Check out the Node.js installation guide. The import tools use Elasticsearch 2.4 for document storage and search. For further details, please refer to the Elasticsearch installation guide.
To check if your Elasticsearch is up and running, call the REST-Interface from the command line:
$ curl -XGET http://localhost:9200/_cluster/health\?pretty\=1
If you are seeing a Unassigned shards warning, you might consider setting the numbers of replicas to 0. This works fine in a development environment:
$ curl -XPUT 'localhost:9200/_settings' -d '
{
index: {
number_of_replicas : 0
}
}'
Extract, transform and load. All the scripts in the tool belt and how to use them.
Extracts text from PDF files, using OCR if necessary. Doing OCR on images within a PDF file is quite useful, since many PDF files are scanned documents. The script accepts a ISO language code as third parameter. Setting the document language will heavily improve OCR quality (default to ENG):
$ node extract.js ./pdf ./text 'POR'
The legacy extraction mode might be quite slow. To use all CPU cores of your machine to crunch PDFs, run
$ node extract-multicore.js ./pdf ./text 'POR'
The multi-core implementation is based on the Node Cluster API.
The script uses node-tika as a bridge between Node.js and Tika (Java). However, extracting text from PDFs and images is error-prone. If you encounter problems, you might try using Tika without the Node.js bridge. Just download the Tika JAR and call it from the command line. Example:
$ java -jar tika-app-1.14.jar -t -i ./pdf -o ./text
Note: The memory allocated by Node.js defaults to 512 MB. In some cases, this is might not be enough and you'll see errors like: FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory. But don't worry, the available memory can be temporarily increased: node --max_old_space_size=4000000 extract.js
.
Prepares an Elasticsearch index for import. Be careful: The old index is deleted. Usage example:
$ node prepare.js localhost:9200 my-index doc
The analyzer uses ASCII folding to enable searching for terms with diacritical characters, replacing diacritic characters with the corresponding ASCII characters. So Conceição in the body field becomes Conceicao in the body.folded field. If you don't need ASCII folding, disabling it might save a lot of (database) space.
Set the analyzer can also be done manually:
$ curl -XPUT localhost:9200/my-index -d '
{
settings: {
analysis: {
analyzer: {
folding: {
tokenizer: "standard",
filter: ["lowercase", "asciifolding"]
}
}
}
}
}'
Set the mapping:
$ curl -XPUT localhost:9200/my-index/_mapping/doc -d '
{
properties: {
body: {
type: "string",
analyzer: "standard",
fields: {
folded: {
type: "string",
analyzer: "folding"
}
}
}
}
}'
Once the Elasticsearch index is prepared, we can start to import the extracted text documents:
$ node import.js ./text localhost:9200 my-index doc
This simple importer saves only the file path and the document body to Elasticsearch. In theory, you could add additional meta data like date, author, language etc. to allow for advanced filtering and sorting.
body: {
file: file,
date: date,
author: date,
language: language,
body: body
}
To check if your document are all in place, run a simple search query on your index:
$ curl -XGET 'localhost:9200/my-index/_search?q=body:my-query&pretty'
Simple search service with an REST interface. The data from the Elasticsearch cluster can be queried via an API service. There are several ways to make a request:
GET http://localhost:3003/match/:query
Full text search. Finds only exact matches John Doe (details).GET http://localhost:3003/custom/:query
Custom full text search. Finds all terms of a query: "John" AND "Doe". Supports wildcards and simple search operators (details).GET http://localhost:3003/fuzzy/:query
Fuzzy search. Finds all similar terms of a query: Jhon ([details](https// https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-fuzzy-query.html)).GET http://localhost:3003/regexp/:query
Regular expression support for term-based search: J.hn* (details).
Run the server:
$ node server.js
Query the service for evil company:
$ curl http://localhost:3003/match/evil%20company
And this is what the response might look like:
{
"took": 2680,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.40393832,
"hits": [
{
"_index": "my-index",
"_type": "doc",
"_id": "AVvJsJpysUfTp_aHjvMj",
"_score": 0.40393832,
"_source": {
"file": "text/document-vii-2016-01-11.txt",
"name": "Notification Of Disclosure, January 11th, 2016 "
},
"highlight": {
"body": [
"Evidence that <em>Evil Company</em> is doing evil things",
"More incriminating information about <em>Evil Company</em>",
]
}
}
]
}
}
Note: The maximum number of results is hard-coded to 100. You can change the limit in the code: const maxSize = 500;
. Currently the big document bodies are excluded from the response. Instead we get an array of highlighted paragraphs, which contain our search term. As before, this can easily be changed in the configuration object.
If you are looking for a web frontend to search your Elasticsearch document collection, have a look at elasticsearch-frontend. The application is build with Express and supports user authentication.
- Move the Elasticsearch database settings (host, port, index) to a
./config
file