-
Notifications
You must be signed in to change notification settings - Fork 109
3.3 The MapreducePipeline Class
markgoldstein edited this page Oct 24, 2014
·
1 revision
The MapreducePipeline class is used to "wire-together" or connect all the steps needed to perform a specific Mapreduce job. It specifies the mapper, reducer, data input reader, output writer and so forth to be used to carry out the job.
Returns filenames from the output writer.
class MapreducePipeline(job_name, mapper_spec, reducer_spec, input_reader_spec, output_writer_spec=None, mapper_params=None, reducer_params=None, shards=None)
The constructor's arguments fully specify a Mapreduce job:
- job_name
- The name of the Mapreduce job. This name shows up in the logs and in the UI.
- mapper_spec
- The name of the mapper used in this mapreduce job. The mapper processes the line by line input from the input reader specified in the input_reader_spec param.
- reducer_spec
- The name of the reducer used in this mapreduce job. The reducer performs work and yields results, using the optional output writer specified in the output_writer_spec param.
- input_reader_spec
- The name of the input reader used in the mapper for this Mapreduce job. The mapper processes the line by line input from the input reader specified.
- output_writer_spec
- The name of the output writer (if any) used to store results from this Mapreduce job.
- mapper_params
- Parameters to use in the input reader.
- reducer_params
- Parameters to use in the output writer.
- shards
- Number of shards to use for this Mapreduce job.
A Mapreduce instance has the following methods:
start(self, **kwargs)
Starts the Mapreduce job. (This method is inherited from the Pipeline class.)