Saturday, July 27, 2013

Why Hadoop?


What is Hadoop all about?

While studying about Big Data, it is inevitable that you will come across the term – “Hadoop” quite frequently. Do you know what this cute yellow elephant is all about?

So, What exactly is Hadoop?
It is truly said that ‘Necessity is the mother of all inventions’ and ‘Hadoop’ is amongst the finest inventions in the world of Big Data! Hadoop had to be developed sooner or later as there was an acute need of a framework that can handle and process Big Data efficiently.
Technically speaking, Hadoop is an open source software framework that supports data-intensive distributed applications. Hadoop is licensed under the Apache v2 license. It is therefore generally known as Apache Hadoop. Hadoop has been developed, based on a paper originally written by Google on MapReduce system and applies concepts of functional programming. Hadoop is written in the Java programming language and is the highest-level Apache project being constructed and used by a global community of contributors. Hadoop was developed by Doug Cutting and Michael J. Cafarella. And the charming yellow elephant you see is basically named after Doug’s son’s toy elephant!
Hadoop Ecosystem:
Once you are familiar with ‘What is Hadoop’, let’s probe into its ecosystem. Hadoop Ecosystem is nothing but various components that make up Hadoop so powerful, among which HDFS and MapReduce are the core components!
1. HDFS:
The Hadoop Distributed File System (HDFS) is a very robust feature of Apache Hadoop. HDFS is designed to amass gigantic amount of data unfailingly, and to transfer the data at an amazing speed among nodes and facilitates the system to continue working smoothly even if any of the nodes fail to function. HDFS is very competent in writing programs, handling their allocation, processing the data and generating the final outcomes. In fact, HDFS manages around 40 petabytes of data at Yahoo! The key components of HDFS are NameNode, DataNodes and Secondary NameNode.

2. MapReduce:
It all started with Google applying the concept of functional programming to solve the problem of how to manage large amounts of data on the internet. Google named it as the ‘MapReduce’ system and was penned down in a paper published by Google. With the ever increasing amount of data generated on the web, MapReduce was created in 2004 and Yahoo stepped in to develop Hadoop in order to implement the MapReduce technique in Hadoop. The function of MapReduce is to help Google in searching and indexing the large quantity of web pages in matter of a few seconds or even in a fraction of a second. The key components of MapReduce are JobTracker, TaskTrackers and JobHistoryServer.

3. Apache Pig:
Apache Pig is another component of Hadoop, which is used to evaluate huge data sets made up of high-level language. In fact, Pig was initiated with the idea of creating and executing commands on Big Data sets. The basic attribute of Pig programs is ‘parallelization’ which helps them to manage large data sets. Apache Pig consists of a compiler that generates a series of MapReduce program and a ‘Pig Latin’ language layer that facilitates SQL-like queries to be run on distributed databases in Hadoop.

4. Apache Hive:
As the name suggests, Hive is Hadoop’s data warehouse system that enables quick data summarization for Hadoop, handle queries and evaluate huge data sets which are located in Hadoop’s file systems and also maintains full support for map/reduce. Another striking feature of  Apache Hive is to provide indexes such as bitmap indexes in order to speed up queries. Apache Hive was originally developed by Facebook, but now it is developed and used by other companies too, including Netflix.

5. Apache HCatalog
Apache HCatalog is another important component of Apache Hadoop which provides a table and storage management service for data created with the help of Apache Hadoop. HCatalog offers features like a shared schema and data type mechanism, a table abstraction for users and smooth functioning across other components of Hadoop such as such as Pig, Map Reduce, Streaming, and Hive.

6. Apache HBase
HBase is acronym for Hadoop DataBase. HBase is a distributed, column oriented database that uses HDFS for storage purposes. On one hand it manages batch style computations using MapReduce and on the other hand it handles point que­ries (random reads). The key components of Apache HBase are HBase Master and the RegionServer.

7. Apache Zookeeper
Apache ZooKeeper is another significant part of Hadoop ecosystem. Its major funciton is to keep a record of configuration information, naming, providing distributed synchronization, and providing group services which are immensely crucial for various distributed systems. Infact, HBase is dependent  upon ZooKeeper for its functioning.
All these components make Hadoop a real solution to face the challenges of Big Data!

The Hype Behind BIG DATA

Big Data!
We come across data in every possible form, whether through social media sites, sensor networks, digital images or videos, cell phone GPS signals, purchase transaction records, web logs, medical records, archives, military surveillance, e-commerce, complex scientific research and so on…it amounts to around some quintillion bytes of data! This data is what we call as…BIG DATA!
Big data is nothing but an assortment of such a huge and complex data that it becomes very tedious to capture, store, process, retrieve and analyze it with the help of on-hand database management tools or traditional data processing techniques. In fact, the concept of “BIG DATA” may vary from company to company depending upon its size, capacity, competence, human resource, techniques and so on. For some companies it may be a cumbersome job to manage a few gigabytes and for others it may be some terabytes creating a hassle in the entire organization.

 Big Data is characterized by: Volume, Velocity and Variety!

1. Volume: BIG DATA depends upon how gigantic it is. It could amount to hundreds of terabytes or even petabytes of information.  For instance, 15 terabytes of facebook posts or 400 billion annual medical records could mean Big Data!
2. Velocity:Velocity means the rate at which data is flowing in the companies. Big data requires fast processing. Time factor plays a very crucial role in several organizations. For instance, processing 2 million records at share market or evaluating results of lakhs of students applied for competitive exams could mean Big Data!
3. Variety: Big Data may not belong to a specific format. It could be in any form such as structured, unstructured, text, images, audio, video, log files, emails, simulations, 3D models, etc. New research shows that a substantial amount of an organization’s data is not numeric; however, such data is equally important for decision-making process. So, organizations need to think beyond stock records, documents, personnel files, finances, etc.

Big Data Opportunities
Why is it important to harness Big Data?
Data had never been as crucial before as it is today! In fact, we can see a transition from the old saying…’Customer is King’ to ‘Data is king’! This is because for an efficient decision making, it is very important to analyze the right amount and the right type of data! Companies whether healthcare, banking, public sector, pharmaceutical, or IT, all need to look beyond the concrete data stored in their databases and study the intangible data in the form of sensors, images, weblogs, etc. In fact, what set apart smart organizations from others is their ability to scan data effectively to allocate resources properly, increase productivity and inspire innovation!

Some points why Big Data analysis is crucial:
1. Just like labor and capital, data has become one of the factors of production in almost all the industries.
2. Big data can unveil some really useful and crucial information which can change decision making process entirely to a more fruitful one.
3. Big data makes customer segmentation easier and more visible, enabling the companies to focus on more profitable and loyal customers.
4. Big data can be an important criterion to decide upon the next line of products and services required by the future customers. Thus, companies can follow proactive approach at every step.
5. The way in which big data is explored and used can directly impact the growth and development of the organizations and give a tough competition to others in the row! Data driven strategies are soon becoming the latest trend at the Management level!
How to Harness Big Data?
As the name suggests, it is not an easy task to capture, store, process and do big data analysis. Optimizing big data is a daunting affair that requires a robust infrastructure and state of art technology which should take care of the privacy, security, intellectual property, and even liability issues related to Big Data. Big data will help you answer those questions that were lingering for a long time! It is not the amount of big data that matters the most, it is what you are able to do with it that draws a line between the achievers and the losers.
Some Recent Technologies:
Companies are relying on the following technologies to do Big data analysis:
•   Speedy and efficient processors.
•   Modern storage and processing technologies, especially for unstructured data
•   Robust server processing capacities
•   Cloud computing
•   Clustering, high connectivity, parallel processing, MPP

•   Apache Hadoop/ Hadoop Big Data