-
Notifications
You must be signed in to change notification settings - Fork 109
1.3 Comparison with Hadoop
Santiago Valdarrama edited this page Feb 18, 2015
·
2 revisions
While Hadoop and the App Engine MapReduce library are similar in function, there are differences between the implementations, summarized in the table below:
Hadoop | App Engine | |
---|---|---|
Partitioning Data |
Hadoop itself partitions the input data into shards (aka splits). The user specifies the number of reducers. The data in each shard is handled in a separate task. Hadoop tasks are often data-bound. The amount of data each task processes is frequently predetermined when the job starts. Large Hadoop jobs tend to have more map and reduce shards than an equivalent job in App Engine. |
The number of input and reducer shards is determined by the Input and Output classes that you use. The data for each shard can be handled by multiple tasks, as explained in slicing, above. |
Controlling Multiple Jobs | The scheduler can be controlled by setting the priority of each job. Higher priority jobs may preempt lower priority jobs that started first. | Multiple jobs run concurrently. The amount of parallelism is controlled by the module configuration and task queue settings. |
Scaling and Persistence | Hadoop clusters tend to be long lived and may persist when there are no running jobs. | App Engine scales dynamically, and typically does not consume resources when nothing is running. |
Combiners | Hadoop supports combiners . |
App Engine does not support combiners. Similar work can be performed in your reduce() method. |
Data Storage | Input, output, and intermediate data (between map and reduce stages) may be stored in HDFS. | Input, output, and intermediate data may be stored in Google Cloud Storage. |
Fault tolerance | The task for each split (shard) can fail and be retried independently. | Retry handling exists at both the shard and slice level. |
Starting Jobs | RPC over HTTP to the Job Tracker. | Method call that can be invoked from any App Engine application. |
Controlling Jobs | Job Tracker maintains state for all running jobs. There is a risk that the Job Tracker can become overwhelmed. | State is distributed in the datastore and task queues. There is no single point of failure. |