BASIC UNDERSTANDING OF MAPREDUCE

BASIC UNDERSTANDING OF MAPREDUCE & PROCEDURE

MapReduce is a framework for easily writing applications which process

vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of

nodes) of commodity hardware in a reliable, fault-tolerant manner.

A MapReduce job basically splits the input data into independent chunks which are

processed by the map tasks in a completely parallel manner. The framework sorts the outputs

of the maps, which are then input to the reduce tasks. Typically both the input and the

output of the job are stored in a file-system. The framework takes care of scheduling tasks,

monitoring them and re-executes the failed tasks.

Inputs and Outputs:

The MapReduce framework operates exclusively on <key, value> pairs, that is, the

framework views the input to the job as a set of <key, value> pairs and produces a set of

<key, value> pairs as the output of the job, conceivably of different types.

There will be 5 stages in MapReduce:

Ø INPUT

Ø SPLIT

Ø MAPPING

Ø SORTING AND SHUFFLING

Ø REDUCE PHASE

Let us try to understand the two tasks Map & Reduce with the help of a small

diagram

The WordCount application is quite straight-forward.

The Mapper implementation , via the map method , processes one line at a time, as provided by the specified TextInputFormat (. It then splits the line into tokens separated by whitespaces, via the StringTokenizer, and emits a key-value pair of < <A>, 1>.

For the above example the first map emits☹(INPUT)

< A B R C C R A C B ,1>

The second map emits(SPLITS):
< A B R, 1>
< C C R, 1>
< A C B, 1>

We'll learn more about the number of maps spawned for a given job, and how to control them in a fine-grained manner, WordCount also specifies a combiner . Hence, the output of each map is passed through the local combiner for local aggregation, after being sorted on the keys.

The output of the first map:(MAPPING & SHUFFLE)
< A, 1>
< B, 1>
< R, 1>

< C, 1>
< C, 1>
< R, 1>

< A, 1>
< C, 1>
< B, 1>

The output of the second map:
< A, 1>
< A, 1>

< B, 1>
< B, 1>

< C, 1>
< C, 1>
< C, 1>

< R, 1>
< R, 1>

The Reducer implementation , via the reduce method just sums up the values, which are the occurence counts for each key (i.e. words in this example).

Thus the output of the job is:
< A, 2>
< B, 2>
< C, 3>
< R, 2>

Reference : click here

Analytics Folder

Search This Blog

BASIC UNDERSTANDING OF MAPREDUCE

Comments

Post a Comment