While studying about Big Data, it is inevitable that you will come across the term – “Hadoop”
quite frequently. Do you know what this cute yellow elephant is
all about?
So, What
exactly is Hadoop?
It is truly said that ‘Necessity is the mother of all
inventions’ and ‘Hadoop’
is amongst the finest inventions in the world of Big Data! Hadoop
had to be developed sooner or later as there was an acute need of a framework
that can handle and process Big Data efficiently.
Technically
speaking, Hadoop is an open source software framework that
supports data-intensive distributed applications. Hadoop is licensed under the
Apache v2 license. It is therefore generally known as Apache Hadoop. Hadoop has
been developed, based on a paper originally written by Google on MapReduce
system and applies concepts of functional programming. Hadoop is written in the
Java programming language and is the highest-level Apache project being
constructed and used by a global community of contributors. Hadoop was
developed by Doug Cutting and Michael J. Cafarella. And the charming yellow
elephant you see is basically named after Doug’s son’s toy elephant!
Hadoop Ecosystem:
Once
you are familiar with ‘What is Hadoop’, let’s probe into its ecosystem. Hadoop
Ecosystem is nothing but various components that make up Hadoop so
powerful, among which HDFS and MapReduce are the core components!
The
Hadoop Distributed File System (HDFS) is a very robust feature of Apache
Hadoop. HDFS is designed to amass gigantic amount of data unfailingly, and to
transfer the data at an amazing speed among nodes and facilitates the system to
continue working smoothly even if any of the nodes fail to function. HDFS is
very competent in writing programs, handling their allocation, processing the
data and generating the final outcomes. In fact, HDFS manages around 40
petabytes of data at Yahoo! The key components of HDFS
are NameNode, DataNodes and Secondary NameNode.
2.
MapReduce:
It
all started with Google applying the concept of functional programming to solve
the problem of how to manage large amounts of data on the internet. Google
named it as the ‘MapReduce’ system and was penned down in a paper published by
Google. With the ever increasing amount of data generated on the web, MapReduce
was created in 2004 and Yahoo stepped in to develop Hadoop in order to
implement the MapReduce technique in Hadoop. The function of MapReduce is to
help Google in searching and indexing the large quantity of web pages in matter
of a few seconds or even in a fraction of a second. The key components of
MapReduce are JobTracker, TaskTrackers and JobHistoryServer.
3. Apache
Pig:
Apache Pig is another component of Hadoop, which is used to
evaluate huge data sets made up of high-level language. In fact, Pig was
initiated with the idea of creating and executing commands on Big Data sets.
The basic attribute of Pig programs is ‘parallelization’ which helps them to manage large data
sets. Apache Pig consists of a compiler that generates a series of MapReduce
program and a ‘Pig Latin’ language layer that facilitates SQL-like queries to
be run on distributed databases in Hadoop.
4. Apache
Hive:
As
the name suggests, Hive is Hadoop’s data warehouse system that enables quick
data summarization for Hadoop, handle queries and evaluate huge data sets which
are located in Hadoop’s file systems and also maintains full support
for map/reduce. Another striking feature of Apache Hive is to
provide indexes such as bitmap indexes in order to speed up queries.
Apache Hive was originally developed by Facebook, but now it is developed
and used by other companies too, including Netflix.
5. Apache
HCatalog
Apache
HCatalog is another important component of Apache Hadoop which provides a table
and storage management service for data created with the help of Apache Hadoop.
HCatalog offers features like a shared schema and data type mechanism, a
table abstraction for users and smooth functioning across other components of
Hadoop such as such as Pig, Map Reduce, Streaming, and Hive.
6. Apache
HBase
HBase
is acronym for Hadoop DataBase. HBase is a distributed, column oriented
database that uses HDFS for storage purposes. On one hand it manages batch
style computations using MapReduce and on the other hand it handles point queries
(random reads). The key components of Apache HBase are HBase Master and the
RegionServer.
7. Apache
Zookeeper
Apache
ZooKeeper is another significant part of Hadoop ecosystem. Its major funciton
is to keep a record of configuration information, naming, providing distributed
synchronization, and providing group services which are immensely crucial for
various distributed systems. Infact, HBase is dependent upon ZooKeeper
for its functioning.
All these components make
Hadoop a real solution to face the challenges of Big Data!
No comments:
Post a Comment