This is a multi-core/shared-memory implementation of the very well-known programming model called MapReduce. MapReduce was widely popularized in the 2000s by a paper that was written by engineers at Google. This paper also inspired the widely used open-source framework Hadoop for implementing distributed processing of large amounts of data. The paper introduced a novel approach to compute large amounts of data (in terms of petabytes) in a scalable and resilient manner.
In this project, the MapReduce programming model is implemented, in the context of multi-core computers.
This particular implementation goes through the following phases:
-
Loading/Splitting data phase: This is the initial phase, where the data are loaded in memory. It goes through the following two steps:
- Reading the data from a specified source. The source of the data can be one of the following:
- From a file. If you want the source of the data to be from a file, use the class FileDataSouce.java.
- Or if your data are not in a file, you can implement the interface DataSource.java to retrieve the data from anywhere you need.
- After the data are loaded, the specific implementation of DataSource.java should split the input into as many pieces as there are available cores/threads in the machine.
- Reading the data from a specified source. The source of the data can be one of the following:
-
Mapping phase: The developer defines an implementation of the interface MapperFunc which simply takes a piece of the data from the splitting phase (a key and a value) and maps that to an intermediate key/value. Each piece of the data from the splitting phase is processed in parallel.
-
Shuffling phase: As the tasks from the mapping phase are completed, the shuffling phase starts. The shuffling phase is responsible for grouping together each intermediate key
<InterKey, InterVal>
(that came out of the mapping phase) and associate the key with all itsInterVal
values producing a map of the form<InterKey, list(InterVal)>
. The shuffling phase does not finish until the last task from the mapping phase finishes, and the computation does not proceed until all shuffling tasks are completed. -
Reducing phase: This is the last phase of the computation. The developer should provide an implementation of the interface ReducerFunc.java which takes as input the intermediate key
InterKey
and the list of intermediate valuesInterVal
and produces as output a tuple of the form<OutKey, OutVal>
. The previous shuffling phase groupslist<InterKey, InterVal>
(the output of the mapping phase) intolist<InterKey, list<InterVal>>
. The class MapReduce.java makes sure that each task of this phase takes approximately the same number of elements fromlist<InterKey, list<InterVal>>
to process. Each task of this phase is carried in parallel. After each task of this phase is completed the output (<OutKey, OutVal>
) is grouped intolist(<OutKey, OutVal>)
and returned to the caller of the methodcompute
of the classMapReduce.java
.