Hadoop is an open source framework that supports the processing and storage of extremely large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation. As we see now a days the data has been grown exponentially right from Giga bytes to Peta bytes and still growing. Handling such a large data is difficult for the traditional systems like Oracle, Teradada etc. One more major factor for hadoop evolution is, maintaining and querying on such large sets of data using older systems is extremely costly and slower. To overcome all such blocks HADOOP has araised as a solution for BIGDATA. So lets kick start our series “Hadoop solution for Bigdata – system installation setup”.
Please go through it very carefully and then try to set up yourself will give you confidence
This series is an aim to develop real time, industrial-standard technology in our readers and to will help add hottest technology in market and/or switch job if you are looking for a change.
Why hadoop is so fast ( in one sentence) ?
Older systems analyse/process record by record but Hadoop breaks data into multiple nodes and processes all datanodes in parallel and collects outputs from each node finally.
- Ability to store and process huge amounts of any kind of data quickly.
- Fault tolerance.Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically.
- Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos.
- Low cost.The open-source framework is free and uses commodity hardware to store large quantities of data.
- You can easily grow your system to handle more data simply by adding nodes. Little administration is required.
- ComputingHadoop’s distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have
What is Big Data?
Big Data is simply an idea that the amount of data that we generate (and more importantly, collect) is increasing extremely quickly. More importantly, companies are recognizing that this data can be used to make more accurate predictions, and therefore, make them more money.So exponentially increasing data in layman terms can be tagged as Bigdata and hadoop is a solution to handle it.
Hadoop Ecosystem is neither a programming language nor a service, it is a platform or framework which solves big data problems. You can consider it as a suite which encompasses a number of services (ingesting, storing, analyzing and maintaining) inside it. Let us discuss and get a brief idea about how the services work individually and in collaboration.
Lets have a brief look on the Hadoop terminologies/components
- Hadoop Distributed File System is the core component or you can say, the backbone of Hadoop Ecosystem.
- HDFS is the one, which makes it possible to store different types of large data sets (i.e. structured, unstructured and semi structured data).
- HDFS creates a level of abstraction over the resources, from where we can see the whole HDFS as a single unit.
- It helps us in storing our data across various nodes and maintaining the log file about the stored data (metadata).
- HDFS has two core components, i.e. NameNode and DataNode.
- The NameNodeis the main node and it doesn’t store the actual data. It contains metadata, just like a log file or you can say as a table of content. Therefore, it requires less storage and high computational resources.
- On the other hand, all your data is stored on the DataNodesand hence it requires more storage resources. These DataNodes are commodity hardware (like your laptops and desktops) in the distributed environment. That’s the reason, why Hadoop solutions are very cost effective.
- You always communicate to the NameNode while writing the data. Then, it internally sends a request to the client to store and replicate data on various DataNodes.
. It performs all your processing activities by allocating resources and scheduling tasks.
It has two major components, i.e. ResourceManager and NodeManager.
- Resource Manager is again a main node in the processing department.
- It receives the processing requests, and then passes the parts of requests to corresponding NodeManagers accordingly, where the actual processing takes place.
- NodeManagersare installed on every DataNode. It is responsible for execution of task on every single DataNode.
- Schedulers: Based on your application resource requirements, Schedulers perform scheduling algorithms and allocates the resources.
- ApplicationsManager: While ApplicationsManager accepts the job submission, negotiates to containers (i.e. the Data node environment where process executes) for executing the application specific ApplicationMaster and monitoring the progress. ApplicationMasters are the deamons which reside on DataNode and communicates to containers for execution of tasks on each DataNode. ResourceManager has two components, i.e. Schedulers and ApplicationsManager.
It is the core component of processing in a Hadoop Ecosystem as it provides the logic of processing. In other words, MapReduce is a software framework which helps in writing applications that processes large data sets using distributed and parallel algorithms inside Hadoop environment.
In a MapReduce program, Map() and Reduce() are two functions.
- TheMap function performs actions like filtering, grouping and sorting.
- While Reduce functionaggregates and summarizes the result produced by map function.
- The result generated by the Map function is a key value pair (K, V) which acts as the input for Reduce function.
A DataNode stores information in the Hadoop File System. A useful record framework has more than one DataNode, with the information imitated crosswise over them.
The NameNode is the centerpiece of a HDFS record framework. It keeps the registry of all records in the document framework, and tracks where over the group the record information is kept. It doesn’t store the information of these record itself.
The Jobtracker is the administration inside hadoop that ranches out MapReduce to particular hubs in the group, in a perfect world the hubs that have the information, or atleast are in a similar rack.
A TaskTracker is a hub in the group that acknowledges errands Map, Reduce and Shuffle operatons – from a Job Tracker.
Optional Namenode entire intention is to have a checkpoint in HDFS. It is only an assistant hub for namenode.
System requirements for Hadoop Installation
For installation of hadoop in pseudo-distributed mode on your computer, simulation of a cluster of computers is done on your single machine.
RAM – at least 8GB
CPU – quad-/hex-/octo-core CPUs, running at least 2-2.5 GHz.
Using Hadoop with Linux is better as there are many documentations on web for installing Hadoop on Linux rather than on Windows.
Apart from that, I would say that it’s better to buy cloud-based hadoop service rather than installing it on your system (but lets first install on your system in later stages we will discuss about cloud installations).
Hadoop Operation Modes
When you have downloaded Hadoop, you can work your Hadoop bunch in one of the three bolstered modes:
Neighborhood/Standalone Mode : After downloading Hadoop in your framework, of course, it is designed in an independent mode and can be keep running as a solitary java process.
Pseudo Distributed Mode : It is a dispersed reproduction on single machine. Each Hadoop daemon, for example, hdfs, yarn, MapReduce and so forth., will keep running as a different java process. This mode is helpful for improvement.
Completely Distributed Mode : This mode is completely dispersed with least at least two machines as a bunch. We will go over this mode in detail in the coming sections.
How to Install Hadoop in Stand-Alone Mode
To follow this tutorial, you will need:
- An Ubuntu 16.04 server with a non-root user with sudoprivileges: You can learn more about how to set up a user with these privileges in our Initial Server Setup with Ubuntu 16.04
Once you’ve completed this prerequisite, you’re ready to install Hadoop and its dependencies.
Hadoop solution for Bigdata – system installation setup
Before introducing Hadoop into the Linux , we have to set up Linux utilizing ssh (Secure Shell). Follow the steps for setting the Linux condition.
Making a User
Toward the starting, it is prescribed to make a different user for Hadoop to confine Hadoop record framework from Unix document framework. Follow the underneath steps to make a client:
- Open the root utilizing the summon “su”
- Make a client from the root account utilizing the summon “useradd username”.
- Presently you can open a current client account utilizing the summon “su username”.
Open the Linux terminal and sort the accompanying commands to make a user.
Step 1 — Installing Java
To get started, we’ll update our package list:
Next, we’ll install OpenJDK, the default Java Development Kit on Ubuntu 16.04.
Once the installation is complete, let’s check the version.
OpenJDK Runtime Environment (build 1.8.0_91-8u91-b14-3ubuntu1~16.04.1-b14)
OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
This output verifies that OpenJDK has been successfully installed.
SSH setup is required to do diverse activities on a group, for example, beginning, halting, circulated daemon shell tasks. To verify diverse users of Hadoop, it is required to give open/private key match for a Hadoop client and offer it with various clients. The hadoop control contents depend on SSH to peform group wide activities. For instance, there is a content for ceasing and beginning every one of the daemons in the clusters. To work consistently, SSH should be setup to password less login for the hadoop client from machines in the group. The most straightforward approach to achive this is to create an open/private key match, and it will be shared over the bunch.
Hadoop requires SSH access to deal with its nodes, i.e. remote machines in addition to your nearby machine. For our single-node setup of Hadoop, we in this manner need to design SSH access to localhost for the hduser client we made in the before.
The accompanying summons are utilized for producing a key esteem combine utilizing SSH.
The second line will make a RSA key combine with a vacant password.
P “”, here shows a vacant password
You need to empower SSH access to your neighborhood machine with this recently made key which is finished by the accompanying order.
The last stage is to test the SSH setup by associating with the nearby machine with the hduser1 client. The progression is likewise expected to spare your nearby machine’s host key unique mark to the hduser client’s known hosts record.
In the event that the SSH association, we can attempt the accompanying (discretionary):
- Empower troubleshooting with ssh – vvv localhost and examine the mistake in detail.
Check the SSH server setup in/and so on/ssh/sshd_config. On the off chance that you rolled out any improvements to the SSH server setup record, you can drive an arrangement reload with sudo/and so on/init.d/ssh reload.
Step 2 — Installing Hadoop
With Java in place, we’ll visit the Apache Hadoop Releases page to find the most recent stable release. Follow the binary for the current release:
Here we will examine the establishment of Hadoop 2.7.3 in independent mode.
There are no daemons running and everything keeps running in a solitary JVM. Independent mode is appropriate for running MapReduce programs amid advancement, since it is anything but difficult to test and investigate them.
On the server, we’ll use wget to fetch it:
Note: The Apache website will direct you to the best mirror dynamically, so your URL may not match the URL above.In order to make sure that the file we downloaded hasn’t been altered, we’ll do a quick check using SHA-256. Return the releases page, then follow the Apache link:
Open the directory for the version you downloaded: Finally, locate the .mds file for the release you downloaded, then copy the link for the corresponding file:Again, we’ll right-click to copy the file location, then use wget to transfer the file:
Then run the verification:
<span style="font-size: 12pt; background-color: #ccffcc;">cat hadoop-7.3.tar.gz.mds</span>
You can safely ignore the difference in case and the spaces. The output of the command we ran against the file we downloaded from the mirror should match the value in the file we downloaded from apache.org.
Now that we’ve verified that the file wasn’t corrupted or changed, we’ll use the tarcommand with the -x flag to extract, -z to uncompress, -v for verbose output, and -fto specify that we’re extracting from a file. Use tab-completion or substitute the correct version number in the command below:
Finally, we’ll move the extracted files into /usr/local, the appropriate place for locally installed software. Change the version number, if needed, to match the version you downloaded.
With the software in place, we’re ready to configure its environment.
Step 3 — Configuring Hadoop’s Java Home
Hadoop requires that you need to set the path to Java, either as an environment variable or in the Hadoop configuration file.The path to Java, /usr/bin/java is a symlink to /etc/alternatives/java, which is in turn a symlink to default Java binary. We will use readlink with the -fflag to follow every symlink in every part of the path, recursively. Then, we’ll use sed to trim bin/java from the output to give us the correct value for JAVA_HOME.
To find the default Java path
You can copy this output to set Hadoop’s Java home to this specific version, which ensures that even if the default Java changes, this value will not. Alternatively, you can use the readlink command dynamically in the file so that Hadoop will automatically use whatever Java version is set as the system default.
To begin, open hadoop-env.sh:
Then, choose one of the following options:
Option 1: Set a Static Value
<span style="font-size: 12pt; background-color: #ccffcc;">export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/</span>
Option 2: Use Readlink to Set the Value Dynamically
Note: With respect to Hadoop, the value of JAVA_HOME in hadoop-env.shoverrides any values that are set in the environment by /etc/profile or in a user’s profile.
Step 4 — Running Hadoop
Now we should be able to run Hadoop:
Usage: hadoop [–config confdir] [COMMAND | CLASSNAME]
CLASSNAME run the class named CLASSNAME
Next, we can use the following command to run the MapReduce hadoop-mapreduce-examples program, a Java archive with several options. We’ll invoke its grepprogram, one of many examples included in hadoop-mapreduce-examples, followed by the input directory, input and the output directory grep_example. The MapReduce grep program will count the matches of a literal word or regular expression. Finally, we’ll supply a regular expression to find occurrences of the word principal within or at the end of a declarative sentence. The expression is case-sensitive, so we wouldn’t find the word if it were capitalized at the beginning of a sentence:
/usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep ~/input ~/grep_example ‘principal[.]*’
When the task completes, it provides a summary of what has been processed and errors it has encountered, but this doesn’t contain the actual results.
Let see and cross check some more alternatives and requirements
Before continuing further, you have to ensure that Hadoop is working fine. Simply issue the accompanying command:
On the off chance that all is well with your setup, at that point you should see the accompanying outcome:
It implies your Hadoop’s independent mode setup is working fine. As a matter of course, Hadoop is arranged to keep running in a non-circulated mode on a solitary machine.
Introducing Hadoop in Pseudo Distributed Mode
Following steps beneath is to introduce Hadoop 2.4.1 in pseudo conveyed mode.
Stage 1: Setting Up Hadoop
You can set Hadoop condition factors by attaching the accompanying charges to ~/.bashrc record.
Presently apply every one of the progressions into the present running framework.
Stage 2: Hadoop Configuration
You can discover all the Hadoop setup documents in the area “$HADOOP_HOME/and so forth/hadoop”. It is required to roll out improvements in those design records as indicated by your Hadoop framework.
Keeping in mind the end goal to create Hadoop programs in java, you need to reset the java condition factors in hadoop-env.sh document by supplanting JAVA_HOME esteem with the area of java in your framework.
The center site.xml document contains data, for example, the port number utilized for Hadoop, memory distributed for the record framework, memory constrain for putting away the information, and size of Read/Write buffers.
Open the center site.xml and include the accompanying properties in the middle of <configuration>, </configuration> tags.
The hdfs-site.xml document contains data, for example, the estimation of replication information, namenode way, and datanode ways of your neighborhood record frameworks. It implies where you need to store the Hadoop foundation.
This record is utilized to arrange yarn into Hadoop. Open the yarn-site.xml document and include the accompanying properties in the middle of the <configuration>, </configuration> labels in this record.
This file is utilized to determine which MapReduce structure we are utilizing. As a matter of course, Hadoop contains a format of yarn-site.xml. As a matter of first importance, it is required to duplicate the document from mapred-site.xml.template to mapred-site.xml record utilizing the accompanying command.
Designing the HDFS filesystem by means of the NameNode
To design the filesystem (which essentially introduces the index determined by the dfs.name.dir variable). Run the command.
Beginning your single-node cluster
Before beginning the cluster, we have to give the expected consents to the catalog with the accompanying command.
Run the command
This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on the machine.
At first you need to design the arranged HDFS document framework, open namenode (HDFS server), and execute the accompanying charge.
In the designing the HDFS, begin the disseminated document framework. The accompanying charge will begin the namenode and in addition the information nodes as cluster.
Posting Files in HDFS
In the of stacking the data in the server, we can discover the rundown of records in a registry, status of a document, utilizing ‘ls’. Given beneath is the sentence structure of ls that you can go to a catalog or a filename as a contention.
Embedding Data into HDFS
Accept we have information in the document called file.txt in the neighborhood framework which is should be spared in the hdfs record framework. Take after the means offered underneath to embed the required document in the Hadoop record framework.
You need to make an info registry.
Exchange and store an information record from nearby frameworks to the Hadoop document framework utilizing the put summon.
You can check the document utilizing ls summon.
Recovering Data from HDFS
Expect we have a record in HDFS called outfile. Given beneath is a basic exhibition for recovering the required document from the Hadoop record framework.
At first, see the information from HDFS utilizing feline charge.
Get the document from HDFS to the nearby record framework utilizing get charge.
Closing Down the HDFS
You can close down the HDFS by utilizing the accompanying order.
There are numerous more commands in “$HADOOP_HOME/bin/hadoop fs” than are shown here, despite the fact that these essential tasks will get you started off. Running ./bin/hadoop dfs with no extra contentions will list every one of the charges that can be keep running with the FsShell framework. Besides, HADOOP_HOME/bin/hadoop fs enable commandName to will show a short utilization outline for the task being referred to. lets proceed in the next topic soon… So this is beginning of how ” Hadoop solution for Bigdata”
We will discuss about some of the industrial standard map-reduce code ….. keep an eye on this blog