Technology

Incredible Power of Hadoop MapReduce

Hadoop is solution for Bigdata problems and projects. It has many powerful technical weapons to kill incredible Big data issues. Mapreduce is one of the Powerful weapons from Hadoop. MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner. Lets discuss in detail  with code about “Incredible Power of Hadoop MapReduce”

What is MapReduce?

MapReduce is a processing technique and a program model for distributed computing based on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job.

The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers. Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a configuration change. This simple scalability is what has attracted many programmers to use the MapReduce model.

The Algorithm
  • Generally MapReduce paradigm is based on sending the computer to where the data resides!
  • MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.
    • Map stage: The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.
    • Reduce stage: This stage is the combination of the Shuffle stage and the Reduce The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.
  • During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster.
  • The framework manages all the details of data-passing such as issuing tasks, verifying task completion, and copying data around the cluster between the nodes.
  • Most of the computing takes place on nodes with data on local disks that reduces the network traffic.
  • After completion of the given tasks, the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop server.

Incredible Power of Hadoop MapReduce

 

 

Inputs and Outputs (Java Perspective)

The MapReduce framework operates on <key, value> pairs, that is, the framework views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job, conceivably of different types.

The key and the value classes should be in serialized manner by the framework and hence, need to implement the Writable interface. Additionally, the key classes have to implement the Writable-Comparable interface to facilitate sorting by the framework. Input and Output types of a MapReduce job: (Input) <k1, v1> -> map -> <k2, v2>-> reduce -> <k3, v3>(Output).

 

Input Output
Map <k1, v1> list (<k2, v2>)
Reduce <k2, list(v2)> list (<k3, v3>)
Terminology

The following terminology is very important from interview prospective. Please have a keen look over this terminology in short. in the upcoming sessions we will discuss these terminologies in detail as per the requirements.

 

  • PayLoad– Applications implement the Map and the Reduce functions, and form the core of the job.
  • Mapper– Mapper maps the input key/value pairs to a set of intermediate key/value pair.
  • NamedNode– Node that manages the Hadoop Distributed File System (HDFS).
  • DataNode– Node where data is presented in advance before any processing takes place.
  • MasterNode– Node where JobTracker runs and which accepts job requests from clients.
  • SlaveNode– Node where Map and Reduce program runs.
  • JobTracker– Schedules jobs and tracks the assign jobs to Task tracker.
  • Task Tracker– Tracks the task and reports status to JobTracker.
  • Job– A program is an execution of a Mapper and Reducer across a dataset.
  • Task– An execution of a Mapper or a Reducer on a slice of data.
  • Task Attempt– A particular instance of an attempt to execute a task on a SlaveNode.
Example Scenario

 

Given below is the data regarding the electrical consumption of an organization. It contains the monthly electrical consumption and the annual average for various years.

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Avg
1979 23 23 2 43 24 25 26 26 26 26 25 26 25
1980 26 27 28 28 28 30 31 31 31 30 30 30 29
1981 31 32 32 32 33 34 35 36 36 34 34 34 34
1984 39 38 39 39 39 41 42 43 40 39 38 38 40
1985 38 39 39 39 39 41 41 41 00 40 39 39 45

If the above data is given as input, we have to write applications to process it and produce results such as finding the year of maximum usage, year of minimum usage, and so on. This is a walkover for the programmers with finite number of records. They will simply write the logic to produce the required output, and pass the data to the application written.

But, think of the data representing the electrical consumption of all the largescale industries of a particular state, since its formation.

When we write applications to process such bulk data,

  • They will take a lot of time to execute.
  • There will be a heavy network traffic when we move data from source to network server and so on.

To solve these problems, we have the MapReduce framework.

Input Data

The above data is saved as sample.txtand given as input. The input file looks as shown below.

1979   23   23   2   43   24   25   26   26   26   26   25   26  25

1980   26   27   28  28   28   30   31   31   31   30   30   30  29

1981   31   32   32  32   33   34   35   36   36   34   34   34  34

1984   39   38   39  39   39   41   42   43   40   39   38   38  40

1985   38   39   39  39   39   41   41   41   00   40   39   39  45

Example Program

Given below is the program to the sample data using MapReduce framework.

<span style="background-color: #ccffcc;">package hadoop;</span>

<span style="background-color: #ccffcc;">import java.util.*;</span>

<span style="background-color: #ccffcc;">import java.io.IOException;</span>

<span style="background-color: #ccffcc;">import java.io.IOException;</span>

<span style="background-color: #ccffcc;">import org.apache.hadoop.fs.Path;</span>

<span style="background-color: #ccffcc;">import org.apache.hadoop.conf.*;</span>

<span style="background-color: #ccffcc;">import org.apache.hadoop.io.*;</span>

<span style="background-color: #ccffcc;">import org.apache.hadoop.mapred.*;</span>

<span style="background-color: #ccffcc;">import org.apache.hadoop.util.*;</span>

<span style="background-color: #ccffcc;">public class ProcessUnits</span>

<span style="background-color: #ccffcc;">{</span>

<span style="background-color: #ccffcc;">//Mapper class</span>

<span style="background-color: #ccffcc;">public static class E_EMapper extends MapReduceBase implements</span>

<span style="background-color: #ccffcc;">Mapper&lt;LongWritable ,/*Input key Type */</span>

<span style="background-color: #ccffcc;">Text,                /*Input value Type*/</span>

<span style="background-color: #ccffcc;">Text,                /*Output key Type*/</span>

<span style="background-color: #ccffcc;">IntWritable&gt;        /*Output value Type*/</span>

<span style="background-color: #ccffcc;">{</span>

//Map function

<span style="background-color: #ccffcc;">public void map(LongWritable key, Text value,</span>

<span style="background-color: #ccffcc;">OutputCollector&lt;Text, IntWritable&gt; output,</span>

<span style="background-color: #ccffcc;">Reporter reporter) throws IOException</span>

<span style="background-color: #ccffcc;">{</span>

<span style="background-color: #ccffcc;">String line = value.toString();</span>

<span style="background-color: #ccffcc;">String lasttoken = null;</span>

<span style="background-color: #ccffcc;">StringTokenizer s = new StringTokenizer(line,"\t");</span>

<span style="background-color: #ccffcc;">String year = s.nextToken();</span>

<span style="background-color: #ccffcc;">while(s.hasMoreTokens())</span>

<span style="background-color: #ccffcc;">{</span>

<span style="background-color: #ccffcc;">lasttoken=s.nextToken();</span>

<span style="background-color: #ccffcc;">}</span>

<span style="background-color: #ccffcc;">int avgprice = Integer.parseInt(lasttoken);</span>

<span style="background-color: #ccffcc;">output.collect(new Text(year), new IntWritable(avgprice));</span>

<span style="background-color: #ccffcc;">}</span>

<span style="background-color: #ccffcc;">}</span>

//Reducer class

<span style="background-color: #ccffcc;">public static class E_EReduce extends MapReduceBase implements</span>

<span style="background-color: #ccffcc;">Reducer&lt; Text, IntWritable, Text, IntWritable &gt;</span>

<span style="background-color: #ccffcc;">{</span>

<span style="background-color: #ccffcc;">//Reduce function</span>

<span style="background-color: #ccffcc;">public void reduce( Text key, Iterator &lt;IntWritable&gt; values,</span>

<span style="background-color: #ccffcc;">OutputCollector&lt;Text, IntWritable&gt; output, Reporter reporter) throws IOException</span>

<span style="background-color: #ccffcc;">{</span>

<span style="background-color: #ccffcc;">int maxavg=30;</span>

<span style="background-color: #ccffcc;">int val=Integer.MIN_VALUE;</span>

<span style="background-color: #ccffcc;">while (values.hasNext())</span>

<span style="background-color: #ccffcc;">{</span>

<span style="background-color: #ccffcc;">if((val=values.next().get())&gt;maxavg)</span>

<span style="background-color: #ccffcc;">{</span>

<span style="background-color: #ccffcc;">output.collect(key, new IntWritable(val));</span>

<span style="background-color: #ccffcc;">}</span>

<span style="background-color: #ccffcc;">}</span>

<span style="background-color: #ccffcc;">}</span>

<span style="background-color: #ccffcc;">}</span>

//Main function

<span style="background-color: #ccffcc;">public static void main(String args[])throws Exception</span>

<span style="background-color: #ccffcc;">{</span>

<span style="background-color: #ccffcc;">JobConf conf = new JobConf(ProcessUnits.class);</span>

<span style="background-color: #ccffcc;">conf.setJobName("max_eletricityunits");</span>

<span style="background-color: #ccffcc;">conf.setOutputKeyClass(Text.class);</span>

<span style="background-color: #ccffcc;">conf.setOutputValueClass(IntWritable.class);</span>

<span style="background-color: #ccffcc;">conf.setMapperClass(E_EMapper.class);</span>

<span style="background-color: #ccffcc;">conf.setCombinerClass(E_EReduce.class);</span>

<span style="background-color: #ccffcc;">conf.setReducerClass(E_EReduce.class);</span>

<span style="background-color: #ccffcc;">conf.setInputFormat(TextInputFormat.class);</span>

<span style="background-color: #ccffcc;">conf.setOutputFormat(TextOutputFormat.class);</span>

<span style="background-color: #ccffcc;">FileInputFormat.setInputPaths(conf, new Path(args[0]));</span>

<span style="background-color: #ccffcc;">FileOutputFormat.setOutputPath(conf, new Path(args[1]));</span>

<span style="background-color: #ccffcc;">JobClient.runJob(conf);</span>

<span style="background-color: #ccffcc;">}</span>

<span style="background-color: #ccffcc;">}</span>

Save the above program as ProcessUnits.java. The compilation and execution of the program is explained below.

Compilation and Execution of Process Units Program

Let us assume we are in the home directory of a Hadoop user (e.g. /home/hadoop).

Follow the steps given below to compile and execute the above program.

Step 1

The following command is to create a directory to store the compiled java classes.

$ mkdir units

Step 2

Download Hadoop-core-1.2.1.jar, which is used to compile and execute the MapReduce program. Visit the following link http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-core/1.2.1 to download the jar. Let us assume the downloaded folder is /home/hadoop/.

Step 3

The following commands are used for compiling the ProcessUnits.java program and creating a jar for the program.

$ javac -classpath hadoop-core-1.2.1.jar -d units ProcessUnits.java

$ jar -cvf units.jar -C units/ .

Step 4

The following command is used to create an input directory in HDFS.

$HADOOP_HOME/bin/hadoop fs -mkdir input_dir

Step 5

The following command is used to copy the input file named sample.txtin the input directory of HDFS.

$HADOOP_HOME/bin/hadoop fs -put /home/hadoop/sample.txt input_dir

Step 6

The following command is used to verify the files in the input directory.

$HADOOP_HOME/bin/hadoop fs -ls input_dir/

Step 7

The following command is used to run the Eleunit_max application by taking the input files from the input directory.

$HADOOP_HOME/bin/hadoop jar units.jar hadoop.ProcessUnits input_dir output_dir

Wait for a while until the file is executed. After execution, as shown below, the output will contain the number of input splits, the number of Map tasks, the number of reducer tasks, etc.

INFO mapreduce.Job: Job job_1414748220717_0002

completed successfully

14/10/31 06:02:52

INFO mapreduce.Job: Counters: 49

File System Counters

FILE: Number of bytes read=61

FILE: Number of bytes written=279400

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=546

HDFS: Number of bytes written=40

HDFS: Number of read operations=9

HDFS: Number of large read operations=0

HDFS: Number of write operations=2 Job Counters

Launched map tasks=2

Launched reduce tasks=1

Data-local map tasks=2

Total time spent by all maps in occupied slots (ms)=146137

Total time spent by all reduces in occupied slots (ms)=441

Total time spent by all map tasks (ms)=14613

Total time spent by all reduce tasks (ms)=44120

Total vcore-seconds taken by all map tasks=146137

Total vcore-seconds taken by all reduce tasks=44120

Total megabyte-seconds taken by all map tasks=149644288

Total megabyte-seconds taken by all reduce tasks=45178880

Map-Reduce Framework

Map input records=5

Map output records=5

Map output bytes=45

Map output materialized bytes=67

Input split bytes=208

Combine input records=5

Combine output records=5

Reduce input groups=5

Reduce shuffle bytes=6

Reduce input records=5

Reduce output records=5

Spilled Records=10

Shuffled Maps =2

Failed Shuffles=0

Merged Map outputs=2

GC time elapsed (ms)=948

CPU time spent (ms)=5160

Physical memory (bytes) snapshot=47749120

Virtual memory (bytes) snapshot=2899349504

Total committed heap usage (bytes)=277684224

File Output Format Counters

Bytes Written=40

Step 8

The following command is used to verify the resultant files in the output folder.

$HADOOP_HOME/bin/hadoop fs -ls output_dir/

Step 9

The following command is used to see the output in Part-00000 file. This file is generated by HDFS.

$HADOOP_HOME/bin/hadoop fs -cat output_dir/part-00000

Below is the output generated by the MapReduce program.

1981    34

1984    40

1985    45

Step 10

The following command is used to copy the output folder from HDFS to the local file system for analyzing.

$HADOOP_HOME/bin/hadoop fs -cat output_dir/part-00000/bin/hadoop dfs get output_dir /home/hadoop

Important Commands

All Hadoop commands are invoked by the $HADOOP_HOME/bin/hadoop command. Running the Hadoop script without any arguments prints the description for all commands.

Usage :

hadoop [--config confdir] COMMAND

The following table lists the options available and their description.

Options Description
namenode -format Formats the DFS filesystem.
secondarynamenode Runs the DFS secondary namenode.
namenode Runs the DFS namenode.
datanode Runs a DFS datanode.
dfsadmin Runs a DFS admin client.
mradmin Runs a Map-Reduce admin client.
fsck Runs a DFS filesystem checking utility.
fs Runs a generic filesystem user client.
balancer Runs a cluster balancing utility.
oiv Applies the offline fsimage viewer to an fsimage.
fetchdt Fetches a delegation token from the NameNode.
jobtracker Runs the MapReduce job Tracker node.
pipes Runs a Pipes job.
tasktracker Runs a MapReduce task Tracker node.
historyserver Runs job history servers as a standalone daemon.
job Manipulates the MapReduce jobs.
queue Gets information regarding JobQueues.
version Prints the version.
jar <jar> Runs a jar file.
distcp <srcurl> <desturl> Copies file or directories recursively.
distcp2 <srcurl> <desturl> DistCp version 2.
archive -archiveName NAME -p Creates a hadoop archive.
<parent path> <src>* <dest>
classpath Prints the class path needed to get the Hadoop jar and the required libraries.
daemonlog Get/Set the log level for each daemon
How MapReduce Organizes Work?

Hadoop divides the job into tasks. There are two types of tasks:

  1. Map tasks(Spilts & Mapping)
  2. Reduce tasks(Shuffling, Reducing)

as mentioned above.

The complete execution process (execution of Map and Reduce tasks, both) is controlled by two types of entities called a

  1. Jobtracker: Acts like a master (responsible for complete execution of submitted job)
  2. Multiple Task Trackers: Acts like slaves, each of them performing the job

For every job submitted for execution in the system, there is one Jobtracker that resides on Namenode and there are multiple tasktrackers which reside on Datanode.

  • A job is divided into multiple tasks which are then run onto multiple data nodes in a cluster.
  • It is the responsibility of jobtracker to coordinate the activity by scheduling tasks to run on different data nodes.
  • Execution of individual task is then look after by tasktracker, which resides on every data node executing part of the job.
  • Tasktracker’s responsibility is to send the progress report to the jobtracker.
  • In addition, tasktracker periodically sends ‘heartbeat’signal to the Jobtracker so as to notify him of current state of the system.
  • Thus jobtracker keeps track of overall progress of each job. In the event of task failure, the jobtracker can reschedule it on a different tasktracker.

will introduce the simple and powerful weapon of Hadoop Hive to you in the next session

— Please read it once again after some time so as to have a clear path. Good Luck Friends

If you like my explanation, please Subscribe for free!!


2 comments

Please leave a comment as it really matters. Your comments are our energy boosters

This site uses Akismet to reduce spam. Learn how your comment data is processed.