Saturday, July 27, 2013
What is Hadoop all about?
While studying about Big Data, it is inevitable that you will come across the term – “Hadoop”
quite frequently. Do you know what this cute yellow elephant is
all about?
So, What
exactly is Hadoop?
It is truly said that ‘Necessity is the mother of all
inventions’ and ‘Hadoop’
is amongst the finest inventions in the world of Big Data! Hadoop
had to be developed sooner or later as there was an acute need of a framework
that can handle and process Big Data efficiently.
Technically
speaking, Hadoop is an open source software framework that
supports data-intensive distributed applications. Hadoop is licensed under the
Apache v2 license. It is therefore generally known as Apache Hadoop. Hadoop has
been developed, based on a paper originally written by Google on MapReduce
system and applies concepts of functional programming. Hadoop is written in the
Java programming language and is the highest-level Apache project being
constructed and used by a global community of contributors. Hadoop was
developed by Doug Cutting and Michael J. Cafarella. And the charming yellow
elephant you see is basically named after Doug’s son’s toy elephant!
Hadoop Ecosystem:
Once
you are familiar with ‘What is Hadoop’, let’s probe into its ecosystem. Hadoop
Ecosystem is nothing but various components that make up Hadoop so
powerful, among which HDFS and MapReduce are the core components!
The
Hadoop Distributed File System (HDFS) is a very robust feature of Apache
Hadoop. HDFS is designed to amass gigantic amount of data unfailingly, and to
transfer the data at an amazing speed among nodes and facilitates the system to
continue working smoothly even if any of the nodes fail to function. HDFS is
very competent in writing programs, handling their allocation, processing the
data and generating the final outcomes. In fact, HDFS manages around 40
petabytes of data at Yahoo! The key components of HDFS
are NameNode, DataNodes and Secondary NameNode.
2.
MapReduce:
It
all started with Google applying the concept of functional programming to solve
the problem of how to manage large amounts of data on the internet. Google
named it as the ‘MapReduce’ system and was penned down in a paper published by
Google. With the ever increasing amount of data generated on the web, MapReduce
was created in 2004 and Yahoo stepped in to develop Hadoop in order to
implement the MapReduce technique in Hadoop. The function of MapReduce is to
help Google in searching and indexing the large quantity of web pages in matter
of a few seconds or even in a fraction of a second. The key components of
MapReduce are JobTracker, TaskTrackers and JobHistoryServer.
3. Apache
Pig:
Apache Pig is another component of Hadoop, which is used to
evaluate huge data sets made up of high-level language. In fact, Pig was
initiated with the idea of creating and executing commands on Big Data sets.
The basic attribute of Pig programs is ‘parallelization’ which helps them to manage large data
sets. Apache Pig consists of a compiler that generates a series of MapReduce
program and a ‘Pig Latin’ language layer that facilitates SQL-like queries to
be run on distributed databases in Hadoop.
4. Apache
Hive:
As
the name suggests, Hive is Hadoop’s data warehouse system that enables quick
data summarization for Hadoop, handle queries and evaluate huge data sets which
are located in Hadoop’s file systems and also maintains full support
for map/reduce. Another striking feature of Apache Hive is to
provide indexes such as bitmap indexes in order to speed up queries.
Apache Hive was originally developed by Facebook, but now it is developed
and used by other companies too, including Netflix.
5. Apache
HCatalog
Apache
HCatalog is another important component of Apache Hadoop which provides a table
and storage management service for data created with the help of Apache Hadoop.
HCatalog offers features like a shared schema and data type mechanism, a
table abstraction for users and smooth functioning across other components of
Hadoop such as such as Pig, Map Reduce, Streaming, and Hive.
6. Apache
HBase
HBase
is acronym for Hadoop DataBase. HBase is a distributed, column oriented
database that uses HDFS for storage purposes. On one hand it manages batch
style computations using MapReduce and on the other hand it handles point queries
(random reads). The key components of Apache HBase are HBase Master and the
RegionServer.
7. Apache
Zookeeper
Apache
ZooKeeper is another significant part of Hadoop ecosystem. Its major funciton
is to keep a record of configuration information, naming, providing distributed
synchronization, and providing group services which are immensely crucial for
various distributed systems. Infact, HBase is dependent upon ZooKeeper
for its functioning.
All these components make
Hadoop a real solution to face the challenges of Big Data!The Hype Behind BIG DATA
Big Data!
We come across data in every possible form, whether
through social media sites, sensor networks, digital images or videos, cell
phone GPS signals, purchase transaction records, web logs, medical records,
archives, military surveillance, e-commerce, complex scientific research and so
on…it amounts to around some quintillion bytes of data! This data is what we
call as…BIG DATA!
Big data is nothing but an assortment of
such a huge and complex data that it becomes very tedious to capture, store,
process, retrieve and analyze it with the help of on-hand database management
tools or traditional data processing techniques. In fact, the concept of “BIG
DATA” may vary from company to company depending upon its size, capacity,
competence, human resource, techniques and so on. For some companies it may be
a cumbersome job to manage a few gigabytes and for others it may be some
terabytes creating a hassle in the entire organization.
Big Data is
characterized by: Volume, Velocity and Variety!
1.
Volume: BIG DATA depends
upon how gigantic it is. It could amount to hundreds of terabytes or even
petabytes of information. For instance, 15 terabytes of facebook posts or
400 billion annual medical records could mean Big Data!
2.
Velocity:Velocity means the
rate at which data is flowing in the companies. Big data requires fast
processing. Time factor plays a very crucial role in several organizations. For
instance, processing 2 million records at share market or evaluating results of
lakhs of students applied for competitive exams could mean Big Data!
3.
Variety: Big Data may not
belong to a specific format. It could be in any form such as structured,
unstructured, text, images, audio, video, log files, emails, simulations, 3D
models, etc. New research shows that a substantial amount of an organization’s
data is not numeric; however, such data is equally important for
decision-making process. So, organizations need to think beyond stock records,
documents, personnel files, finances, etc.
Big Data Opportunities
Why is it
important to harness Big Data?
Data had never been as crucial before as it is today! In fact,
we can see a transition from the old saying…’Customer is King’ to ‘Data is king’! This is because for an efficient decision
making, it is very important to analyze the right amount and the right type of
data! Companies whether healthcare, banking, public sector, pharmaceutical, or
IT, all need to look beyond the concrete data stored in their databases and
study the intangible data in the form of sensors, images, weblogs, etc. In
fact, what set apart smart organizations from others is their ability to scan
data effectively to allocate resources properly, increase productivity and
inspire innovation!
Some
points why Big Data analysis is crucial:
1.
Just like labor and capital, data has become one of the factors of production
in almost all the industries.
2.
Big data can unveil some really useful and crucial information which
can change decision making process entirely to a more fruitful one.
3.
Big data makes customer segmentation easier and more visible, enabling
the companies to focus on more profitable and loyal customers.
4.
Big data can be an important criterion to decide upon the next line
of products and services required by the future customers. Thus,
companies can follow proactive approach at every step.
5. The way in which big data is explored and used can directly
impact the growth and development of the organizations and give a tough
competition to others in the row!
Data driven strategies are soon becoming the latest trend at the
Management level!
How to
Harness Big Data?
As
the name suggests, it is not an easy task to capture, store, process and do big
data analysis. Optimizing big data is a daunting affair that requires a robust
infrastructure and state of art technology which should take care of the
privacy, security, intellectual property, and even liability issues related to
Big Data. Big data will help you answer those questions that were lingering for
a long time! It is not the amount of big data that matters the most, it is what
you are able to do with it that draws a line between the achievers and the
losers.
Some Recent Technologies:
Companies
are relying on the following technologies to do Big data analysis:
•
Speedy and efficient processors.
•
Modern storage and processing technologies, especially for unstructured
data
•
Robust server processing capacities
•
Cloud computing
•
Clustering, high connectivity, parallel processing, MPP
•
Apache Hadoop/ Hadoop Big Data
Subscribe to:
Comments (Atom)
