Technology

Realtime importance of Hadoop Distributions

Hadoop is basiclly freeware and anyone can use the Apache version of it directly. But we have three main distributions which are the industrial standard wrappers over core hadoop to ease the implementation and improve the security. Any industry chooses any one of these top 3 distributions which suits best to their needs. We are going to have an in detail discussion over the “Realtime importance of Hadoop Distributions” with sample code. In the previous session we have learned about industrial standard hadoop installation. This session is continuation of the Hadoop Series.

With the demand for big data technologies expanding rapidly, Apache Hadoop is at the heart of the big data revolution. The open source framework hadoop is somewhat immature and big data analytics companies are now eyeing on Hadoop vendors- a growing community that delivers robust capabilities, tools and innovations for improvised commercial hadoop big data solutions. Here are top 3 big data analytics vendors that are serving Hadoop needs of various big data companies by providing commercial support.

Marcus Collins, a Research Analyst at Gartner said Big data analytics and the Apache Hadoop open source project are rapidly emerging as the preferred Big Data solutions to address business and technology trends that are disrupting traditional data management and processing.”

Realtime importance of Hadoop Distributions

Which Hadoop distribution is right for you?

Hadoop is an open-source framework for processing Big Data. These days, there are many Hadoop distributions to choose from, and one of them happens to be Apache Hadoop distribution, which comes under Apache Software Foundation. This distribution is free and has a very large community behind it.
If an enterprise wants to deploy an Apache Hadoop distribution along with their existing enterprise application, it becomes difficult because Hadoop is written in Java and optimized for Linux-based operating systems. This may lead to impedance mismatch between Hadoop and the current enterprise applications. For this reason, integration of Hadoop ecosystem with existing components of enterprise is not straightforward.

To solve this issue, few companies came up with distribution models for Hadoop. There are three primary kinds of Hadoop distribution flavors. They are as follows:

  • Companies that provide paid support and training for the Apache Hadoop distribution. (Cloudera, HortonWorks, MapR, IBM, etc.)
  • Companies that provide a set of supporting tools for deployment and management of Apache Hadoop as an alternative flavor. (Cloudera, HortonWorks, MapR)
  • The third model is for companies that provide enhancing features of Apache Hadoop by adding vendor specific features and code. These features are paid enhancements, many of which are capable of solving certain use cases. (Cloudera, HortonWorks, MapR, IBM and all companies which take up Hadoop projects)

The parent of all this distributions is the open-source Apache Hadoop. Companies developing these distributions will always stay in touch with Apache Hadoop and follow its trends.

Advantage of using them are:

  • This distributions generally test all the features in an in depth and timely manner.
  • They provide support, which saves administration, and management costs for an organization.

The disadvantage of using a distribution other than Apache Hadoop is vendor lock-in.

The tools and vendor specific features provided by one vendor might not be available in another distribution or those features may be not be compatible with other third-party tools, bringing in the cost of migration. The cost of migration is not limited to technology shifts alone, it also involves training, capacity planning, and re-architecting costs for the organization.

Now the big question is, “Which Hadoop distribution is right for your organization?”

Selecting a Hadoop distribution can seem pretty difficult, but it all comes down to which distribution best suits your Big Data needs.

Assuming Hadoop solves the problem, how do you decide which Hadoop distribution to use? They all look similar,  all of them containing more than a dozen open-source software components, all distributions work on commodity hardware and can pretty much run similar sets of analytical workloads. Yet there are few differences in terms of what you get for your money.

To solve this issue, few companies came up with distribution models for Hadoop. There are three primary kinds of Hadoop distribution flavors. They are as follows:

  • Companies that provide paid support and training for the Apache Hadoop distribution. (Cloudera, HortonWorks, MapR, IBM, etc.)
  • Companies that provide a set of supporting tools for deployment and management of Apache Hadoop as an alternative flavor. (Cloudera, HortonWorks, MapR)
  • The third model is for companies that provide enhancing features of Apache Hadoop by adding vendor specific features and code. These features are paid enhancements, many of which are capable of solving certain use cases. (Cloudera, HortonWorks, MapR, IBM and all companies which take up Hadoop projects)

The parent of all this distributions is the open-source Apache Hadoop. Companies developing these distributions will always stay in touch with Apache Hadoop and follow its trends.

Advantage of using them are:

  • This distributions generally test all the features in an in depth and timely manner.
  • They provide support, which saves administration, and management costs for an organization.

The disadvantage of using a distribution other than Apache Hadoop is vendor lock-in.

The tools and vendor specific features provided by one vendor might not be available in another distribution or those features may be not be compatible with other third-party tools, bringing in the cost of migration. The cost of migration is not limited to technology shifts alone, it also involves training, capacity planning, and re-architecting costs for the organization.

Selecting a Hadoop distribution can seem pretty difficult, but it all comes down to which distribution best suits your Big Data needs.

Assuming Hadoop solves the problem, how do you decide which Hadoop distribution to use? They all look similar,  all of them containing more than a dozen open-source software components, all distributions work on commodity hardware and can pretty much run similar sets of analytical workloads. Yet there are few differences in terms of what you get for your money.

When evaluating Hadoop distributions, there are different criteria to consider. They are as follows:

How does it perform?

The Apache Hadoop distribution is written in Java and runs in its own virtual machine called JVM (Java Virtual Machine). Though this increases application portability, it comes with little overhead  as it should compile the code into byte-code, and has to do  garbage collection. Comparatively, it is not as fast as an application that directly compiles for target hardware.

To address this issue, some vendors optimize their distributions for a particular hardware, increasing job performance per node. Features such as compression and decompression can also be optimized for certain hardware types.

Is it scalable?

Distributions should be scalable I.e. distribution should expand in terms of resources, in both, computational and storage dimensions.

When it comes to scaling out the cluster, ideally it should be limited to addition of more disks to existing nodes or adding new nodes to the cluster network. However, distributions might impose few difficulties in effort and cost required for scaling a Hadoop cluster. Scaling out the cluster comes with heavy administration and deployment costs. Scaling costs will depend on the existing architecture and how it complements and complies with the Hadoop distribution that is being evaluated.

Is it reliable?

Any distributed system will be subjected to partial failures. Failures can be because of hardware, software, or network issues, and have a smaller mean time when running on commodity hardware.

The major weakness about Hadoop technology is that, the NameNode used to locate and keep track of all of the other nodes related to a certain data set is a Single point of failure (SPOF). In other words, if the NameNode fails, all of the data in the other nodes is lost because it cannot be found without the NameNode.

To overcome this issue, Hadoop version 2.x has come up with NameNode High Availability. In this approach, Hadoop runs 2 NameNode, one active and other one as a stand by. In case of failure of primary node, the stand by will come up instantly and thereby solving Single point of failure (SPOF).

How manageable is it?

Deploying and managing the Apache Hadoop open-source distribution requires internal understanding of the source code and configuration. This is not a widely available skill in IT administration. In addition, administrators in enterprises are caretakers of a wide range of systems, Hadoop being one of them.

Distributions offer integration with development and debugging tools. Developers and scientists in an enterprise will already be using a set of tools and the more overlap between the toolset used by the organization and distribution, the better it is. The advantage of overlap not only comes in the form of licensing costs, but also in a lesser need for training and orientation. It might also increase productivity within the organization, as people are already accustomed to certain tools.

Now, let’s look at available Hadoop distributions in the market.

There are a number of distributions of Hadoop. A comprehensive list can be found here. Hyperlink below url

http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support

We will be examining five of them that are widely used:

  • Apache Hadoop Distribution
  • Cloudera Distribution of Hadoop (CDH)
  • Hortonworks Data Platform (HDP)
  • MapR
  • Pivotal HD

Apache Hadoop Distribution:

The Apache Hadoop Distribution is developed by the Apache Software Foundation. This distribution is available free of cost and has a very large community behind it. This distribution serves as the base for all other distributions. The contributions by this community shapes the Apache Hadoop distribution.

Deployment and management of the Apache Hadoop distribution within an enterprise requires internal understanding of the source code and configuration.

Cloudera Distribution of Hadoop (CDH)

Cloudera was formed in March 2009 with a primary objective of providing Apache Hadoop software, support, services, and training for enterprise-class deployment of Hadoop and its ecosystem components.

Cloudera brands its distribution as  Cloudera Distribution of Hadoop (CDH).  Cloudera is one of the major sponsors of Apache Software Foundation, and will push almost all the enhancements to its distributions. They also support in servicing in Hadoop deployment.

CDH is in its fifth major version (CDH5) right now and is considered to be a mature Hadoop distribution. The paid version of CDH comes with a proprietary management software, Cloudera Manager.

Hortonworks Data Platform (HDP)

Hortonworks is formed in  June, 2011, a company with objectives similar to Cloudera. Their distribution is branded as Hortonworks Data Platform (HDP). The HDP suite’s Hadoop and other software are completely free, with paid support and training. Hortonworks also pushes enhancements upstream, back to Apache Hadoop.

HDP is in its second major version currently and is considered the rising star in Hadoop distributions. It comes with a free and open source management software called Ambari.

MapR

MapR was founded in 2009 with a mission to come up enterprise-grade Hadoop. The Hadoop distribution they provide has significant proprietary code when compared to Apache Hadoop. There are a handful of components where they guarantee compatibility with existing Apache Hadoop projects. Key proprietary code for the MapR distribution is the replacement of HDFS with a POSIX-compatible NFS. Another key feature is the capability of taking snapshots.

MapR comes with its own management console. The different grades of the product are named as M3, M5, and M7. M5 is a standard commercial distribution from the company, M3 is a free version without high availability, and M7 is a paid version with a rewrittent HBase API.

Which is better? It depends entirely on what you’re seeking. Because Hadoop is licensed under the Apache License, which is a free software license, these vendors will automatically provide patches and updates to the core Hadoop distribution, something that everyone benefits from. So it’s best to instead turn your attention to each of the strengths and weaknesses based on the product offered and the available add-ons developed for your use.

Here are a few things that make each of the top three vendors stand out from each other:

  • For tutorials, I slightly prefer Hortonworks because of how they’re presented online. Now, consider that I did try to go through the tutorials using “Hortonworks Sandbox” (based on 2.0) and had issues with running some of the examples without them failing. Hopefully this isn’t a widespread problem.
  • From a training perspective, Cloudera seems to have the most complete and professional training program of the three. But with that comes a bigger price tag — Cloudera’s training program and exams are typically the costliest.
  • One thing that makes Hortonworks stand out quite a bit is that it supports the Microsoft Windows operating system, whereas the other vendors support the Linux operating system. (Microsoft has also taken Hortonworks product, and packaged it into its own service called HDInsight, and it can be used for both on-premise Hadoop installations, or it can be run in Windows Azure cloud service.)
  • While Cloudera and Hortonworks both go with the NameNode and DataNode architecture for splitting up where the metadata is saved and data processing is done, and both depend on HDFS, MapR has a more distributed approach for saving the metadata on the processing nodes, and it depends on a different distributed file system architecture.
  • Hadoop 2 was released recently, and if immediate upgrade offerings are important to you, Hortonworks was the first to release a complete production-ready Hadoop distribution based on version two. Cloudera did have Hadoop 2 features in an earlier version, but some of the components weren’t considered production-ready.

These are three companies that have been very strong in the past year, and have received quite a bit of venture capital funding. In this regard, MapR is even more interesting because late last year news broke that they were planning on going public, which means they could raise even more money for their products and development. This is an exciting time as big data is gearing up to really take off, and if MapR’s IPO plan is any indicator, this next year is going to be very interesting to watch.

HOW TO ANALYZE SERVER LOGS WITH CASCADING AND APACHE HADOOP ON HORTONWORKS DATA PLATFORM

Historically, programming languages and software frameworks have evolved in a singular direction, with a singular purpose: to achieve simplicity, hide complexity, improve developer productivity, and make coding easier. And in the process, foster innovation to the degree we have seen today—and benefited from.

Anyone among you is “young” enough to admit writing code in microcode and assembly language?

Yours truly wrote his first lines of “Hello World” in assembly language on the VAX and PDP 11, the same computer hardware (and software) that facilitated the genesis of “C” and “UNIX” at the Bell Labs. Indisputably, we have come a long way from microcode to assembly, from to C to Java, which has facilitated writing high-level abstraction frameworks and enabled innovative technologies, such as J2EE web services frameworks and Apache Hadoop and MapReduce computing frameworks, to mention a few.

Add to that long list Cascading Java SDK for writing data-driven applications for the Apache Hadoop running on the Hortonworks Data Platform (HDP). And even more, the Cascading 3.0 will support Apache Tez, the high-performance parallel execution MapReduce engine.

DATA FLOW AND DATA PIPELINE

In the previous blog, I explored the genesis of Java Cascading Framework. I argued that at the core of any data-driven application, there exists a pattern of data transformation that mimics aspects of Extract, Transform, and Load operations (ETL).

I showed how Cascading framework embraces those common ETL operations by providing high-level logical building blocks as Java composite classes, allowing a developer to write data-driven apps without resorting to or knowing about the MapReduce Java API or having the know-how of underlying Apache Hadoop infrastructure complexity.

PARSING LOGS WITH MAPREDUCE

In this blog we examine a common usage of reading, parsing, transforming, sorting, storing, and extracting data value from a large server blog. The value extracted is the list of top-ten in-bound IP addresses. For this example, we’ve curtailed one server log to 160 MB. In reality, these could be weeks’ of servers logs, with gigabytes of data.

Realtime importance of Hadoop Distributions

Keeping the above flow in mind, we can write a very simple Java MapReduce program, without writing to the Java MapReduce APIor without knowledge of the underlying Apache Hadoop complexity. For example, below is a complete source listing of the above transformation—in less than 40 lines of code: that’s simple!

import cascading.flow.Flow;

import cascading.flow.FlowDef;

import cascading.flow.hadoop.HadoopFlowConnector;

import cascading.operation.aggregator.Count;

import cascading.operation.filter.Sample;

import cascading.operation.filter.Limit;

import cascading.operation.regex.RegexParser;

import cascading.operation.text.DateParser;

import cascading.pipe.*;

import cascading.property.AppProps;

import cascading.scheme.hadoop.TextDelimited;

import cascading.scheme.hadoop.TextLine;

import cascading.tap.SinkMode;

import cascading.tap.Tap;

import cascading.tap.hadoop.Hfs;

import cascading.tuple.Fields;

import java.util.Properties;

public class Main {

public static void main(String[] args) {

// input (taps) and output (sinks)

String inputPath = args[0];

String outputPath = args[1];

// sources and sinks

Tap inTap = new Hfs(new TextLine(), inputPath);

Tap outTap  = new Hfs(new TextDelimited(true, "t"), outputPath, SinkMode.REPLACE);

// Parse the line of input and break them into five fields

RegexParser parser = new RegexParser(new Fields("ip", "time", "request", "response", "size"),

"^([^ ]*) \S+ \S+ \[([\w:/]+\s[+\-]\d{4})\] "(.+?)" (\d{3}) ([^ ]*).*$", new int[]{1, 2, 3, 4, 5});

// Create a pipe for processing each line at a time

Pipe processPipe = new Each("processPipe", new Fields("line"), parser, Fields.RESULTS);

// Group the stream within the pipe by the field "ip"

processPipe = new GroupBy(processPipe, new Fields("ip"));

// Aggregate each "ip" group using the Cascading built in Count function

processPipe = new Every(processPipe, Fields.GROUP, new Count(new Fields("IPcount")), Fields.ALL);

// After aggregation counter for each "ip," sort the counts

Pipe sortedCountByIpPipe = new GroupBy(processPipe, new Fields("IPcount"), true);

// Limit them to the first 10, in the descending order

sortedCountByIpPipe = new Each(sortedCountByIpPipe, new Fields("IPcount"), new Limit(10));

// Join the pipe together in the flow, creating inputs and outputs (taps)

FlowDef flowDef = FlowDef.flowDef()

.addSource(processPipe, inTap)

.addTailSink(sortedCountByIpPipe, outTap)

.setName("DataProcessing");

Properties properties = AppProps.appProps()

.setName("DataProcessing")

.buildProperties();

Flow parsedLogFlow = new HadoopFlowConnector(properties).connect(flowDef);

//Finally, execute the flow.

parsedLogFlow.complete();

}

}

CODE WALK

First, we create taps (sources and sinks) using the HFS() constructor, followed by how we want each line to be split, using RegexParser()into five fields. Second, we create a series of Pipes() to process each line, aggregate or GroupBy() IPs, sort the IP’s count, and limit the count to top ten IP addresses. Third, we connect the pipes into a FlowDefFlow() and HadoopFlowConnector(), with pipes, input taps, and output taps. And finally, we execute the flow with Flow.complete().

So in four easy programming steps we created a MapReduce program without a single reference to the MapReduce API.  Instead, we used only high-level logical constructs and classes in Cascading SDK.

The compiled jar file and the run on the HDP 2.1 Sandbox resulted in the following output. Note the number of MapReduce jobs and the final submission to YARN.

Realtime importance of Hadoop Distributions

Note : please go through this session at least 2-3 times as most of the interview question when asked to test your real basics are covered here. Next few sessions are really going to be very serious and to the point to make your hadoop skills raised to industrial level.

 

Please leave a comment as it really matters. Your comments are our energy boosters

This site uses Akismet to reduce spam. Learn how your comment data is processed.