BASIC UNDERSTANDING OF MAPREDUCE

                    
                                     BASIC UNDERSTANDING OF MAPREDUCE & PROCEDURE

                                                                  

MapReduce is a framework for easily writing applications which process
vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of
nodes) of commodity hardware in a reliable, fault-tolerant manner.

 A MapReduce job basically splits the input data into independent chunks which are
processed by the map tasks in a completely parallel manner. The framework sorts the outputs
of the maps, which are then input to the reduce tasks. Typically both the input and the
output of the job are stored in a file-system. The framework takes care of scheduling tasks,
monitoring them and re-executes the failed tasks.

Inputs and Outputs:
The MapReduce framework operates exclusively on <key, value> pairs, that is, the
framework views the input to the job as a set of <key, value> pairs and produces a set of
<key, value> pairs as the output of the job, conceivably of different types.
There will be 5 stages in MapReduce:
Ø  INPUT
Ø  SPLIT
Ø  MAPPING
Ø  SORTING AND SHUFFLING
Ø  REDUCE PHASE


Let us try to understand the two tasks Map & Reduce with the help of a small
diagram

The WordCount application is quite straight-forward.
The Mapper implementation , via the map method , processes one line at a time, as provided by the specified TextInputFormat (. It then splits the line into tokens separated by whitespaces, via the StringTokenizer, and emits a key-value pair of < <A>, 1>.
For the above example  the first map emits(INPUT)
< A B R C C R A C B ,1>


The second map emits(SPLITS):
< A B R, 1>
< C C R, 1>
< A C B, 1> 
We'll learn more about the number of maps spawned for a given job, and how to control them in a fine-grained manner, WordCount also specifies a combiner . Hence, the output of each map is passed through the local combiner  for local aggregation, after being sorted on the keys.
The output of the first map:(MAPPING & SHUFFLE)
< A, 1>
< B, 1>
< R, 1> 
< C, 1>
< C, 1>
< R, 1> 
< A, 1>
< C, 1>
< B, 1> 
The output of the second map:
< A, 1>
< A, 1>
< B, 1>
< B, 1>
< C, 1>
< C, 1>
< C, 1> 
< R, 1>
< R, 1>
The Reducer implementation , via the reduce method  just sums up the values, which are the occurence counts for each key (i.e. words in this example).
Thus the output of the job is:
< A, 2>
< B, 2>
< C, 3>
< R, 2> 

Reference : click here

               Read more about Analytics

Comments