What is Hadoop ? and Interesting information about Big data and Hadoop



Today, I wanted to share some helpful information about how Hadoop is used in the industry -
What is Big Data and what is the Big Deal?

This is what happened on Facebook in the last 20 minutes - 1 million links shared, 1.5 million event invites, 1.9 million friend requests, 2.7 million photos uploaded, 2.8 million messages sent, 1 million tags, 1.5 million status updates and 2.8 million comments.

There are hundreds of companies like Facebook that are being exposed to LARGE amounts of data like above. This situation where companies and institutions have to support, store, analyze and make decisions using large amounts of data is called Big Data.

Its a Big Deal because, using Big Data one can build better products, offer better services and predict the future better. All this means Big Money. So Big Data is a Big Deal !



What is Hadoop? Why is a funny looking elephant the logo?
Hadoop is an open source software project. It is used chiefly in situations where there is LARGE amounts of data to analyze - structured and unstructured data.

For example, banks need to draw patterns from millions of transactions, retail stores need to craft promotions based on millions of purchases, social networks need to analyze billions of events, Ad-networks need to analyze billions of clicks etc.

Traditional software frameworks are incapable of efficiently handling such large volumes of data using conventional hardware. The Hadoop project provides a set of tools such as MapReduce, Hive, Pig etc that offers developers the flexibility to perform operations on large amounts of data using normal hardware.

The data analysis jobs are split up on various computers and parallel processed using Hadoop.
"Hadoop" is the name of the toy elephant, belonging to the daughter of Doug Cutting - the Creator of Hadoop. So Doug decided to name his software project after his daughters toy "Hadoop".


Who invented Hadoop?
"All roads lead to Rome" - Google invented the basic frameworks that constitute what is today popularly called as Hadoop. They faced the future first with the problem of handling billions of searches and indexing millions of web pages. When they could not find any large scale, distributed, scalable computing platforms for their needs, they just went ahead and created their own.

Doug Cutting was inspired by Google's white papers and decided to create an open source project called "Hadoop".

Yahoo further contributed to this project and played a key role in developing Hadoop for enterprise applications. Since then many companies such as Facebook, Linkedin, ebay, Hortonworks, Cloudera etc have contributed to the Hadoop project.


So why should I care?
"For every 100 Big Data jobs there are only 2 qualified candidates" - Fastcompany

"By 2018, the United States will create 290,000 to 340,000 new big data jobs and more than half could go unfilled because skilled candidates are in short supply." - McKinsey

Convinced? ....Not yet...Here's more

"In a survey of 3,000 global companies, more than 83% of respondents identified business analytics from big data as a top priority" - IBM

"With Big Data skills, there is an opportunity for students to gain a market advantage while starting their career" - Terri Griffith, Professor, Santa Clara University

What is Apache Hadoop?
Hadoop is an open source software project. It is used chiefly in situations where there is LARGE amounts of data to analyze – structured and unstructured data.  For example, banks need to draw patterns from millions of transactions, retail stores need to craft promotions based on millions of purchases, social networks need to analyze billions of events, Ad-networks need to analyze billions of clicks etc.


How does Facebook use Hadoop?

Facebook claims to have the largest Hadoop cluster of 4,000 machines which can handle 100 petabytes of data - 1 petabyte is equal to 250 billion pages of text.
Everyday people share 2.5 billion items on Facebook  which include status updates, wall posts, photos, videos and comments.

Using the underlying framework of Hadoop, Facebook can process trillions of connections between people, places and things in minutes.


What is Apache Hive?
It is an open source data warehouse framework developed by Facebook. It is primary used for data analysis. Using Hive, you can query the data in Hadoop using SQL like language called HiveQL.


How does Facebook use Hive?
Facebook uses Hive in analyzing and summarizing data in the Facebook Ad Network. Using Hive, the advertisers on Facebook are able to view the aggregated statistics about their ads like number of clicks, number of impressions, number of unique users, click through rates etc.

Hive was also used for Facebook's Lexicon product that reported trends by tracking buzz words on Facebook walls.

Using Hive,the jobs that took days to compete - could now be completed in hours.


What is Apache Pig?
It is a platform for analyzing large data sets. It is a high level procedural language used to simplify querying large data sets.


How does Linkedin use Pig?

Pig is a large component of Linkedin's data cycle. It is used behind features like "People you may know", "Who viewed my profile".

Another popular use of Pig in any company is log analysis.


What is Apache HBase?
HBase is a large-scale, distributed nosql database built on top of the Hadoop Distributed File System (HDFS). It is an Apache open source project.


Click here to learn more about Certificate in Big Data & Hadoop


How does Facebook use HBase?
The messaging application on Facebook, which includes messages, chat and email uses HBase. It handles 350 million users sending 15 billion person-to-person messages per month and supports 300 million users sending 120 billion chat messages per month.

Facebook also uses Hbase to store harvested data used in Graph Search. With 1 billion new posts added every day, the index contains more than 1 trillion total posts comprising of hundreds of terabytes of data.


How does Twitter use Hadoop, HBase and Pig?

It powers features such as “who to follow” recommendations, tailored follow suggestions for new users and “best of Twitter” emails.


Was this helpful to you? Do let me know what other information you want me to mail you.


Mail us : analyticsfolder.nagarjuna@gmail.com

Comments