1.3 Comparison with Hadoop

While Hadoop and the App Engine MapReduce library are similar in function, there are differences between the implementations, summarized in the table below:

	Hadoop	App Engine
Partitioning Data	Hadoop itself partitions the input data into shards (aka splits). The user specifies the number of reducers. The data in each shard is handled in a separate task. Hadoop tasks are often data-bound. The amount of data each task processes is frequently predetermined when the job starts. Large Hadoop jobs tend to have more map and reduce shards than an equivalent job in App Engine.	The number of input and reducer shards is determined by the Input and Output classes that you use. The data for each shard can be handled by multiple tasks, as explained in slicing, above.
Controlling Multiple Jobs	The scheduler can be controlled by setting the priority of each job. Higher priority jobs may preempt lower priority jobs that started first.	Multiple jobs run concurrently. The amount of parallelism is controlled by the module configuration and task queue settings.
Scaling and Persistence	Hadoop clusters tend to be long lived and may persist when there are no running jobs.	App Engine scales dynamically, and typically does not consume resources when nothing is running.
Combiners	Hadoop supports `combiners`.	App Engine does not support combiners. Similar work can be performed in your reduce() method.
Data Storage	Input, output, and intermediate data (between map and reduce stages) may be stored in HDFS.	Input, output, and intermediate data may be stored in Google Cloud Storage.
Fault tolerance	The task for each split (shard) can fail and be retried independently.	Retry handling exists at both the shard and slice level.
Starting Jobs	RPC over HTTP to the Job Tracker.	Method call that can be invoked from any App Engine application.
Controlling Jobs	Job Tracker maintains state for all running jobs. There is a risk that the Job Tracker can become overwhelmed.	State is distributed in the datastore and task queues. There is no single point of failure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.3 Comparison with Hadoop

Clone this wiki locally