Hadoop, a solution for Bigdata has several individual components which combined together is called as hadoop-eco-system. Lets have an in depth analysis of what are the components of hadoop and their importance. This is must to have information for cracking any technical interview. So lets see ” HADOOP ECOSYSTEM COMPONENTS AND ITS ARCHITECTURE”
All the components of the Hadoop ecosystem, as explicit entities are evident. The holistic view of Hadoop architecture gives prominence to Hadoop common, Hadoop YARN, Hadoop Distributed File Systems (HDFS) and Hadoop MapReduce of the Hadoop Ecosystem. Hadoop common provides all Java libraries, utilities, OS level abstraction, necessary Java files and script to run Hadoop, while Hadoop YARN is a framework for job scheduling and cluster resource management. HDFS in Hadoop architecture provides high throughput access to application data and Hadoop MapReduce provides YARN based parallel processing of large data sets.
What are the biggest reasons a company should switch to a Hadoop architecture?
If you would like more information about Big Data careers, please click the orange “Request Info” button on top of this page.
In our earlier articles, we have defined “What is Apache Hadoop” .To recap, Apache Hadoop is a distributed computing open source framework for storing and processing huge unstructured datasets distributed across different clusters. The basic principle of working behind Apache Hadoop is to break up unstructured data and distribute it into many parts for concurrent data analysis. Big data applications using Apache Hadoop continue to run even if any of the individual cluster or server fails owing to the robust and stable nature of Hadoop.
What is Hadoop Ecosystem?
Hadoop is mainly a framework and Hadoop ecosystem includes a set of official Apache open source projects and a number of commercial tools and solutions. Spark, Hive, Oozie, Pig, and Squoop are few of the popular open source tools, while the commercial tools are mainly provided by the vendors Cloudera, Hortonworks and MapR. Apart from this, a large number of Hadoop productions, maintenance, and development tools are also available from various vendors. These tools or solutions support one or two core elements of the Apache Hadoop system, which are known as HDFS, YARN, MapReduce, Common.
Architecture of Apache Hadoop
Apache Hadoop is used to process ahuge amount of data. The architecture of Apache Hadoop consists of various technologies and Hadoop components through which even the complex data problems can be solved easily. The following image represents the architecture of Hadoop Ecosystem:
Hadoop architecture is based on master-slave design. In Hadoop when the data size is large the data files are stored on multiple servers and then the mapping is done to reduce further operations. Each server works as a node, so each node of the map has the computing power and are not dump like disk drives. The master-slave architecture is followed by the data processing in the Hadoop system, which looks like the following figure:
Following is the description of each component of this image:
Namenode: It controls operation of data
Datanode: Datanodes writes the data to local storage. To store all data at a single place is not always recommended, as it may cause loss of data in case of outage situation.
task tracker: They accept tasks assigned to the slave node
Map:It takes data from a stream and each line is processed after splitting it into various fields
Reduce: Here the fields, obtained through Map are grouped together or concatenated with each other
Components of Hadoop Ecosystem
The key components of Hadoop file system include following:
HDFS (Hadoop Distributed File System):
This is the core component of Hadoop Ecosystem and it can store a huge amount of structured, unstructured and semi-structured data. It can create an abstract layer of the entire data and a log file of data of various nodes can also be maintained and stored through this file system. Name Node and Data Node are two key components of HDFS
Here the Name Node stores meta data instead of original data and require less storage and computational resources. All data is stored in the Data Nodes and require more storage resources and it requires commodity hardware like laptops or desktops, which makes the Hadoop solution costlier.
YARN or Yet Another Resource Navigator is like the brain of the Hadoop ecosystem and all processing is performed right here, which may include resource allocation, job scheduling, and activity processing. The two major components of YARN are Node Manager and Resource Manager. Here the Resource Manager passes the parts of requests to the appropriate Node Manager. Every Data Node has a Node Manager, which is responsible for task execution. It has following architecture:
YARN is a dynamic resource utilization and the user can run various Hadoop applications, using YARN framework without increasing workloads. It offers high sociability, agility, new and unique programming models and improved utilization of the clusters.
HADOOP ECOSYSTEM COMPONENTS AND ITS ARCHITECTURE
MapReduce is a combination of two operations, named as Map and Reduce.It also consists of core processing components and helps to write the large data sets using parallel and distributed algorithms inside the Hadoop environment. Map and Reduce are basically two functions, which are defined as:
Map function performs grouping, sorting and filtering operations, while Reduce function summarizes and aggregates the result, produced by Map function. The result of these two functions is a Key-> Value pair, where the keys are mapped to the values to reduce the processing. Map Reduce framework of Hadoop is based on YARN architecture, which supports parallel processing of large data sets.
The basic concept behind MapReduce is that the “Map” sends a query to various datanodes for processing and “Reduce” collects the result of these queries and output a single value
Here the Job Tracker and Task Tracker are two daemons, which tackles the task of job tracking in MapReduce processing.
Apache PIG is a procedural language, which is used for parallel processing applications to process large data sets in Hadoop environment and this language is an alternative for the Java programming. Pig includes two components Pig Latin and the Pig run time, just like Java and JVM. Pig Latin has SQL like commands. Non-programmers can also use Pig Latin as it involves very less coding and SQL like commands. At the back-end of Pig Latin, the MapReduce job executes. The compiler converts the Latin into MapReduce and produces sequential job sets, which is called an abstraction.
It is usually used for complex use-cases and require multiple data operations and is a processing language rather than a query language. Through Pig the applications for sorting and aggregation can be developed. Through this customizable platform, the user can write his own application. It supports all popular programming languages, including Ruby, Python, and Java. For those who love to write applications in these programming languages, it can be the best option.
HBase is an open source and non-relational or NoSQL database. It supports all data types and so can handle any data type inside a Hadoop system. It runs on HDFS and is just like Google’s BigTable, which is also a distributed storage system and can support large data sets. HBase itself is written in Java and its applications are written using REST, Thrift APIs and Avro. HBase is designed to solve the problems, where a small amount of data or information is to be searched in a huge amount of data or database.
This NoSQL database was not designed to handle transnational or relational database. Instead, is designed to handle non-database related information or data. It does not support SQL queries, however, the SQL queries can run inside HBase using another tool from the Apache vendor like Hive, it can run inside HBase and can perform database operations. HBase is designed to store structured data, which may have billions of rows and columns. A large number of messaging applications like Facebook are designed using this technology.It has ODBC and JDBC drivers as well.
Mahout, Spark MLib:
Mahout is used for machine learning and provides the environment for developing the machine learning applications. Through this, we can design self-learning machines, which can be used for explicit programming. Moreover, such machines can learn by the past experiences, user behavior and data patterns. Just like artificial intelligence it can learn from the past experience and take the decisions as well. Mahout can perform clustering, filtering and collaboration operations, the operations which can be performed by Mahout are discussed below:
- Collaborative filtering:On the basis of user behavior the machine can learn user behavioral pattern and characteristics. The typical example of this includes e-commerce website.
- Clustering:It can build the clusters of similar data groups, including blogs, research papers, and news.
- Classification: The categorical data classification is done and various categories are created using this like news, blogs, essay, research papers and others.
- Item Set Creation and Recommendation: Through this Mahout can check that which data items are usually brought together and create its sets. Like with a mobile phone, generally, the phone cover is also purchased and recommended. So, Mahout will recommend the phone cover or another accessory to the clients, who will buy the cell phone.
To manage the clusters, one can use Zookeeper, it is also known as the king of coordination, which can provide reliable, fast and organized operational services for the Hadoop clusters. Zookeeper can provide distributed configuration service, synchronization service and the feature of naming registry for the distributed environment. When Zookeeper was not there, the complete process of task coordination was quite difficult and time-consuming. The synchronization process was also problematic at the time of configuration and the changes in the configuration were also difficult.
Zookeeper provides a speedy and manageable environment and saved a lot of time by performing grouping, maintenance, naming and synchronization operations in less time. It offers a powerful solution for the Hadoop use cases. Many big brands, like eBay, Yahoo and Rackspace are using Zookeeper for many of their use-cases. Therefore Zookeeper has become an important Hadoop tool.
Job scheduling is an important and unavoidable process for Hadoop system. Apache Oozie performs the job scheduling and works like an alarm and clock service inside the Hadoop Ecosystem. Oozie can schedule the Hadoop jobs and bind them together so that logically they can work together.The two kinds of jobs, which mainly Oozie performs, are:
- Oozie Workflow: In this, the sequential set of instructions and actions are executed. Here if there is more than one job to be executed, then the last one is allowed to get completed and then the second last is executed.
- Oozie Coordinator: the Oozie jobs are triggered when the data arrive for processing. These Oozie jobs rest or do not execute, if the data do not arrive else they are executed to take the proper action.
Ambari is a project of Apache Software Foundation and it can make the Hadoop ecosystem more manageable. This project of Apache includes managing, monitoring, and provisioning of the Hadoop clusters. It is a web-based tool and supports HDFS, MapReduce, Hadoop, HCatalog, HBase, Hive, Oozie, Zookeeper, and Pig. Ambari wizard is very much helpful and provides a step-by-step set of instructions to install Hadoop ecosystem services and a metric alert framework to monitor the health status of Hadoop clusters. Following are the main services of Hadoop:
- Provisioning:The step-by-step process to install Hadoop ecosystem across the Hadoop clusters and to handle the configuration services of the Hadoop clusters.
- Cluster Management: The centrally managed service used to start, stop and re-configure Hadoop services on varies
- Cluster Monitoring: A dashboard is there, which is used to monitor the health status of various clusters. Ambari Alert Framework used to send anotification when the user attention is needed e.g. when any node goes down or the disk space goes low on any node, then a notification is sent to the user.
At Last – To Summarize
Hadoop is a successful ecosystem and the credit goes to its developer’s community. Many big companies like Google, Yahoo, Facebook, etc. are using Hadoop and have increased its capabilities as well. While learning Hadoop knowledge of just one or two tools may not be sufficient. It is important to learn all Hadoop components so that a complete solution can be obtained. Hadoop ecosystem involves a number of tools and day by day the new tools are also developed by the Hadoop experts. It has become an integral part of the organizations, which are involved in huge data processing.
Apache Hive Internal and External Tables
Hive is an open source data warehouse system used for querying and analyzing large datasets. Data in Apache Hive can be categorized into Table, Partition, and Bucket. The table in Hive is logically made up of the data being stored. Hive has two types of tables which are as follows:
- Managed Table (Internal Table)
- External Table
Hive Managed Tables-
It is also know an internal table. When we create a table in Hive, it by default manages the data. This means that Hive moves the data into its warehouse directory.
Hive External Tables-
We can also create an external table. It tells Hive to refer to the data that is at an existing location outside the warehouse directory.
Let’s now discuss the Hive Internal tables vs External tables comparison.
- Featured Difference between Hive Internal Tables vs External Tables
Here we are going to cover the comparison between Hive Internal tables vs External tables on the basis of different features. Let’s discuss them one by one-
3.1. LOAD and DROP Semantics
We can see the main difference between the two table type in the LOAD and DROP semantics.
- Managed Tables – When we load data into a Managed table, then Hive moves data into Hive warehouse directory.
CREATE TABLE managed_table (dummy STRING);
LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table;
This moves the file hdfs://user/tom/data.txt into Hive’s warehouse directory for the managed_table table, which is hdfs://user/hive/warehouse/managed_table.
Further, if we drop the table using:
DROP TABLE managed_table
Then this will delete the table metadata including its data. The data no longer exists anywhere. This is what it means for HIVE to manage the data.
- External Tables – External table behaves differently. In this, we can control the creation and deletion of the data. The location of the external data is specified at the table creation time:
CREATE EXTERNAL TABLE external_table(dummy STRING)
LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external_table;
Now, with the EXTERNAL keyword, Apache Hive knows that it is not managing the data. So it doesn’t move data to its warehouse directory. It does not even check whether the external location exists at the time it is defined. This very useful feature because it means we create the data lazily after creating the table.
The important thing to notice is that when we drop an external table, Hive will leave the data untouched and only delete the metadata.
- Managed Tables – Hive solely controls the Managed table security. Within Hive, security needs to be managed; probably at the schema level (depends on organization).
- External Tables – These tables’ files are accessible to anyone who has access to HDFSfile structure. So, it needs to manage security at the HDFS file/folder level.
3.3. When to use Managed and external table
Use Managed table when –
- We want Hive to completely manage the lifecycle of the data and table.
- Data is temporary
Use External table when –
- Data is used outside of Hive. For example, the data files are read and processed by an existing program that does not lock the files.
- We are not creating a table based on the existing table.
- We need data to remain in the underlying location even after a DROP TABLE. This may apply if we are pointing multiple schemas at a single data set.
- The hive shouldn’t own data and control settings, directories etc., we may have another program or process that will do these things.
Apache Hive Command Line Options
Below are the commonly used Apache Hive command line options:
|Hive Command Line Option||Description|
|-d,–define <key=value>||Variable substitution to apply to Hive commands.
e.g. -d A=B or –define A=B
|-e <quoted-query-string>||Execute SQL from command line.
e.g. -e ‘select * from table1’
|-f <filename>||Execute SQL from file. e.g. -f ‘/home/sql_file.sql’|
|-H,–help||Print Hive help information. Usually, print all command line options.|
|-h <hostname>||Connect to host server in case if connecting from remote host|
|–hiveconf <property=value>||Use value for given property|
|–hivevar <key=value>||Variable substitution to apply to Hive commands.
e.g. –hivevar A=B
|-i <filename>||Initialization SQL file|
|-p <port>||Hive Server port number|
|-S,–silent||Silent mode in interactive shell|
|-v,–verbose||Verbose mode (echo executed SQL to the console)|
Hive Command Line Options Usage Examples
Execute query using hive command line options
$ hive -e ‘select * from test’;
Execute query using hive command line options in silent mode
$ hive -S -e ‘select * from test’
Dump data to the file in silent mode
$hive -S -e 'select col from tab1' > a.txt
Hadoop is written with huge amount of clusters of computers in mind and is built upon the following assumptions:
- Hardware may fail due to any external or technical malfunction where instead commodity hardware can be used.
- Processing will be run in batches and there exits an emphasis on high throughput as opposed to low latency.
- Applications which run on HDFS have large sets of data. A typical file in HDFS may be of gigabytes to terabytes in size.
- Applications require a write-once-read-many access model.
- Moving Computation is cheaper compared to the Moving Data.
Hadoop ecosystem is not a service and programming , Hadoop ecosystem is the one type of platform which used to processing a large amount of Hadoop Data.Hadoop ecosystem using HDFS and MapReducefor Storing and processing the large amount of data and also used Hive for querying the data.Hadoop Ecosystem consists of following three different types of data
- Structured Data– Data having clear structure which can be stored at tablular form
- Semi Structured Data– Data having some structure which cannot stored at tabular form
- Un Structured Data– Data does not having any structure which cannot stored data at tabular form