BASIC UNDERSTANDING OF MAPREDUCE &
PROCEDURE
MapReduce is a framework for easily writing applications
which process
vast amounts of data (multi-terabyte data-sets) in-parallel
on large clusters (thousands of
nodes) of commodity hardware in a reliable, fault-tolerant
manner.
processed by the map tasks in a completely parallel manner.
The framework sorts the outputs
of the maps, which are then input to the reduce tasks.
Typically both the input and the
output of the job are stored in a file-system. The framework
takes care of scheduling tasks,
monitoring them and re-executes the failed tasks.
Inputs and Outputs:
The MapReduce framework operates exclusively on <key,
value> pairs, that is, the
framework views the input to the job as a set of <key,
value> pairs and produces a set of
<key, value> pairs as the output of the job,
conceivably of different types.
There will be 5 stages in MapReduce:
Ø
INPUT
Ø
SPLIT
Ø
MAPPING
Ø
SORTING AND SHUFFLING
Ø
REDUCE PHASE
Let us try to understand the two tasks Map & Reduce with
the help of a small
diagram
The WordCount application is quite straight-forward.
The Mapper implementation
, via the map method
, processes one line at a time, as provided by the specified TextInputFormat (.
It then splits the line into tokens separated by whitespaces, via the StringTokenizer, and
emits a key-value pair of < <A>, 1>.
For
the above example the first map emits☹(INPUT)
< A B R C C R A C B ,1>
The second map emits(SPLITS):
< A B R, 1>
< C C R, 1>
< A C B, 1>
< A B R, 1>
< C C R, 1>
< A C B, 1>
We'll
learn more about the number of maps spawned for a given job, and how to control
them in a fine-grained manner, WordCount also specifies a combiner .
Hence, the output of each map is passed through the local combiner for local aggregation, after being sorted on
the keys.
The
output of the first map:(MAPPING & SHUFFLE)
< A, 1>
< B, 1>
< R, 1>
< A, 1>
< B, 1>
< R, 1>
< C, 1>
< C, 1>
< R, 1>
< C, 1>
< R, 1>
< A, 1>
< C, 1>
< B, 1>
< C, 1>
< B, 1>
The output of the second map:
< A, 1>
< A, 1>
< A, 1>
< A, 1>
< B, 1>
< B, 1>
< B, 1>
< C, 1>
< C, 1>
< C, 1>
< C, 1>
< C, 1>
< R, 1>
< R, 1>
< R, 1>
The Reducer implementation , via the reduce method just sums up
the values, which are the occurence counts for each key (i.e. words in this
example).
Thus
the output of the job is:
< A, 2>
< B, 2>
< C, 3>
< R, 2>
< A, 2>
< B, 2>
< C, 3>
< R, 2>
Reference : click
here
Comments
Post a Comment