Home

pignlproc usage tips

Here are some tips to use pignlproc tools to mine wikipedia & dbpedia dumps:

Fetching the data from the official Wikipedia and DBpedia online dumps or from EBS volumes on Amazon EC2.
Splitting a Wikipedia XML dump using Mahout into small chunks is useful to make pig able to work in parallel for instance on a S3 bucket. It is also useful to test a script locally on a small chunk before launching the script as a job on Hadoop cluster.
Running pignlproc scripts on a EC2 Hadoop cluster using Apache Whirr makes it possible to setup a Hadoop Cluster on Amazon EC2 with a minimalist configuration file. This make it possible to run a cluster of up to 20 nodes quite easily.