-
Notifications
You must be signed in to change notification settings - Fork 31
Home
Welcome to the Extract wiki! This is where we document workflows, guides and details about the inner workings of a piece of software that is essential in helping ICIJ fulfil its mission.
Extract is a cross-platform command line tool for parallel, distributed content-extraction. Built on top of Apache Tika, it is able to extract text and metadata from a wide range of different formats.
Extract streams the output from Tika instead of buffering it all into memory before writing. This allows it to operate on very large files without memory issues.
It supports Redis-backed queueing for distributed extraction and will write to a Solr endpoint, plain text files or standard output.
At ICIJ, we used Extract to pull text and metadata from over 12 million files leaked from Mossack Fonseca. For more about this, see, Wrangling 2.6TB of data: The people and the technology behind the Panama Papers.
In short, we ran an array of 35 g2.2xlarge EC2 machines, each spewing text and metadata at a Solr core over HTTP. All this was done using the open source distribution you see here.
We ran a customised Blacklight frontend in front of Solr, allowing over 400 journalists simultaneous access to the corpus.