Today, I wanted to share some helpful information about how Hadoop is used in the industry -
What is Big Data and what is the Big Deal?
This is what happened on Facebook in the last 20 minutes - 1
million links shared, 1.5 million event invites, 1.9 million friend requests,
2.7 million photos uploaded, 2.8 million messages sent, 1 million tags, 1.5
million status updates and 2.8 million comments.
There are hundreds of companies like Facebook that are being
exposed to LARGE amounts of data like above. This situation where companies and
institutions have to support, store, analyze and make decisions using large
amounts of data is called Big Data.
Its a Big Deal because, using Big Data one can build better
products, offer better services and predict the future better. All this means
Big Money. So Big Data is a Big Deal !
What is Hadoop? Why is a funny looking elephant the logo?
Hadoop is an open source software project. It is used
chiefly in situations where there is LARGE amounts of data to analyze -
structured and unstructured data.
For example, banks need to draw patterns from millions of
transactions, retail stores need to craft promotions based on millions of
purchases, social networks need to analyze billions of events, Ad-networks need
to analyze billions of clicks etc.
Traditional software frameworks are incapable of efficiently
handling such large volumes of data using conventional hardware. The Hadoop
project provides a set of tools such as MapReduce, Hive, Pig etc that offers
developers the flexibility to perform operations on large amounts of data using
normal hardware.
The data analysis jobs are split up on various computers and
parallel processed using Hadoop.
"Hadoop" is the name of the toy elephant,
belonging to the daughter of Doug Cutting - the Creator of Hadoop. So Doug
decided to name his software project after his daughters toy
"Hadoop".
Who invented Hadoop?
"All roads lead to Rome" - Google invented the
basic frameworks that constitute what is today popularly called as Hadoop. They
faced the future first with the problem of handling billions of searches and
indexing millions of web pages. When they could not find any large scale,
distributed, scalable computing platforms for their needs, they just went ahead
and created their own.
Doug Cutting was inspired by Google's white papers and
decided to create an open source project called "Hadoop".
Yahoo further contributed to this project and played a key
role in developing Hadoop for enterprise applications. Since then many
companies such as Facebook, Linkedin, ebay, Hortonworks, Cloudera etc have
contributed to the Hadoop project.
So why should I care?
"For every 100 Big Data jobs there are only 2 qualified
candidates" - Fastcompany
"By 2018, the United States will create 290,000 to
340,000 new big data jobs and more than half could go unfilled because skilled
candidates are in short supply." - McKinsey
Convinced? ....Not yet...Here's more
"In a survey of 3,000 global companies, more than 83%
of respondents identified business analytics from big data as a top
priority" - IBM
"With Big Data skills, there is an opportunity for
students to gain a market advantage while starting their career" - Terri
Griffith, Professor, Santa Clara University
What is Apache Hadoop?
Hadoop is an open source software project. It is used
chiefly in situations where there is LARGE amounts of data to analyze –
structured and unstructured data. For
example, banks need to draw patterns from millions of transactions, retail
stores need to craft promotions based on millions of purchases, social networks
need to analyze billions of events, Ad-networks need to analyze billions of
clicks etc.
How does Facebook use Hadoop?
Facebook claims to have the largest Hadoop cluster of 4,000
machines which can handle 100 petabytes of data - 1 petabyte is equal to 250
billion pages of text.
Everyday people share 2.5 billion items on Facebook which include status updates, wall posts, photos,
videos and comments.
Using the underlying framework of Hadoop, Facebook can
process trillions of connections between people, places and things in minutes.
What is Apache Hive?
It is an open source data warehouse framework developed by
Facebook. It is primary used for data analysis. Using Hive, you can query the
data in Hadoop using SQL like language called HiveQL.
How does Facebook use Hive?
Facebook uses Hive in analyzing and summarizing data in the
Facebook Ad Network. Using Hive, the advertisers on Facebook are able to view
the aggregated statistics about their ads like number of clicks, number of
impressions, number of unique users, click through rates etc.
Hive was also used for Facebook's Lexicon product that
reported trends by tracking buzz words on Facebook walls.
Using Hive,the jobs that took days to compete - could now be
completed in hours.
What is Apache Pig?
It is a platform for analyzing large data sets. It is a high
level procedural language used to simplify querying large data sets.
How does Linkedin use Pig?
Pig is a large component of Linkedin's data cycle. It is
used behind features like "People you may know", "Who viewed my
profile".
Another popular use of Pig in any company is log analysis.
What is Apache HBase?
HBase is a large-scale, distributed nosql database built on
top of the Hadoop Distributed File System (HDFS). It is an Apache open source
project.
Click here to learn more about Certificate in Big Data &
Hadoop
How does Facebook use HBase?
The messaging application on Facebook, which includes
messages, chat and email uses HBase. It handles 350 million users sending 15
billion person-to-person messages per month and supports 300 million users
sending 120 billion chat messages per month.
Facebook also uses Hbase to store harvested data used in
Graph Search. With 1 billion new posts added every day, the index contains more
than 1 trillion total posts comprising of hundreds of terabytes of data.
How does Twitter use Hadoop, HBase and Pig?
It powers features such as “who to follow” recommendations,
tailored follow suggestions for new users and “best of Twitter” emails.
Was this helpful to you? Do let me know what other
information you want me to mail you.
Mail us : analyticsfolder.nagarjuna@gmail.com
Comments
Post a Comment