-
Notifications
You must be signed in to change notification settings - Fork 6
Pseudodistributed hadoop
Hadoop can also run in pseudo-distributed mode. It still runs on one machine, but multiple threads can run at the same time. This lets you take advantage of, say, one large multi-core machine to do extraction faster than in standalone mode. These instructions are adapted from the official apache quickstart guide here.
You should set some extra configuration parameters. In $HADOOP/conf/core-site.xml
:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
In $HADOOP/conf/hdfs-site.xml
:
<configuration>
<property>
<name>dfs.replications</name>
<value>1</value>
</property>
</configuration>
In $HADOOP/conf/mapred-site.xml
(mapred.tasktracker.{map,reduce}.tasks.maximum
set the maximum number of concurrent map and reduce tasks):
<configuration>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>4</value>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>4</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >>~/.ssh/authorized_keys
Format the new filesystem:
$HADOOP/bin/hadoop namenode -format
Start the daemons:
$HADOOP/bin/start-all.sh
And now you're ready to run some hadoop jobs!
- Copy your data to the "distributed" filesystem:
hadoop fs -put corpus.unified input
-
- In your thrax.conf, the
hadoop-work-dir
key should be set relative to a "home" directory you have on the distributed filesystem. Do not use the defaultfile://
prefix! Leaving this key blank will let thrax use a sensible default:./thrax_run_YYYY_MM_DD_hhmmss
. - The
work-dir
still refers to the local filesystem; somewhere in/tmp
is fine to use. - The
input-file
works similarly tohadoop-work-dir
; it is relative to the distributed filesystem. Since you just copied it withhadoop fs -put
, you know exactly where it is sitting. In the example from number 1, you would set this key toinput
.
- In your thrax.conf, the
- Run! It's exactly the same command:
$THRAX/thrax <config>
You can stop the daemons with $HADOOP/bin/stop-all.sh
.