Pseudodistributed hadoop

Hadoop can also run in pseudo-distributed mode. It still runs on one machine, but multiple threads can run at the same time. This lets you take advantage of, say, one large multi-core machine to do extraction faster than in standalone mode. These instructions are adapted from the official apache quickstart guide here.

Configuration

You should set some extra configuration parameters. In $HADOOP/conf/core-site.xml:

    <configuration>
        <property>
            <name>fs.default.name</name>
            <value>hdfs://localhost:9000</value>
        </property>
    </configuration>

In $HADOOP/conf/hdfs-site.xml:

    <configuration>
        <property>
            <name>dfs.replications</name>
            <value>1</value>
        </property>
    </configuration>

In $HADOOP/conf/mapred-site.xml (mapred.tasktracker.{map,reduce}.tasks.maximum set the maximum number of concurrent map and reduce tasks):

    <configuration>
        <property>
            <name>mapred.tasktracker.map.tasks.maximum</name>
            <value>4</value>
        </property>
        <property>
            <name>mapred.tasktracker.reduce.tasks.maximum</name>
            <value>4</value>
        </property>
        <property>
            <name>mapred.job.tracker</name>
            <value>localhost:9001</value>
        </property>
    </configuration>

Set up passphraseless ssh

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa

cat ~/.ssh/id_dsa.pub >>~/.ssh/authorized_keys

Preparing the "grid"

Format the new filesystem:

$HADOOP/bin/hadoop namenode -format

Start the daemons:

$HADOOP/bin/start-all.sh

And now you're ready to run some hadoop jobs!

Running

Copy your data to the "distributed" filesystem: hadoop fs -put corpus.unified input
1. In your thrax.conf, the hadoop-work-dir key should be set relative to a "home" directory you have on the distributed filesystem. Do not use the default file:// prefix! Leaving this key blank will let thrax use a sensible default: ./thrax_run_YYYY_MM_DD_hhmmss.
2. The work-dir still refers to the local filesystem; somewhere in /tmp is fine to use.
3. The input-file works similarly to hadoop-work-dir; it is relative to the distributed filesystem. Since you just copied it with hadoop fs -put, you know exactly where it is sitting. In the example from number 1, you would set this key to input.
Run! It's exactly the same command: $THRAX/thrax <config>

Cleanup

You can stop the daemons with $HADOOP/bin/stop-all.sh.

Provide feedback

Saved searches