Skip to content
Olivier Grisel edited this page Jan 6, 2011 · 12 revisions

pignlproc usage tips

Here are some tips to use pignlproc tools to mine wikipedia & dbpedia dumps:

  • Fetching the data from the official Wikipedia and DBpedia online dumps or from EBS volumes on Amazon EC2.

  • Splitting a Wikipedia XML dump using Mahout into small chunks is useful to make pig able to work in parallel for instance on a S3 bucket. It is also useful to test a script locally on a small chunk before launching the script as a job on Hadoop cluster.

  • Running pignlproc scripts on a EC2 Hadoop cluster using Apache Whirr makes it possible to setup a Hadoop Cluster on Amazon EC2 with a minimalist configuration file. This make it possible to run a cluster of up to 20 nodes quite easily.