What are the benefits of using either Hadoop or HBase or Hive ?
From my understanding, HBase avoi
Consider that you work with RDBMS and have to select what to use - full table scans, or index access - but only one of them.
If you select full table scan - use hive. If index access - HBase.
Hadoop:
HDFS stands for Hadoop Distributed File System which uses Computational processing model Map-Reduce.
HBase:
HBase is Key-Value storage, good for reading and writing in near real time.
Hive:
Hive is used for data extraction from the HDFS using SQL-like syntax. Hive use HQL language.
Pig:
Pig is a data flow language for creating ETL. It's an scripting language.
Understanding in depth
Hadoop
Hadoop
is an open source project of the Apache
foundation. It is a framework written in Java
, originally developed by Doug Cutting in 2005. It was created to support distribution for Nutch
, the text search engine. Hadoop
uses Google's Map Reduce
and Google File System Technologies as its foundation.
Features of Hadoop
Hadoop
is for high throughput rather than low latency. It is a batch operation handling massive quantities of data; therefore the response time is not immediate.RDBMS
.Versions of Hadoop
There are two versions of Hadoop
available :
Hadoop 1.0
It has two main parts :
1. Data Storage Framework
It is a general-purpose file system called Hadoop Distributed File System (HDFS
).
HDFS
is schema-less
It simply stores data files and these data files can be in just about any format.
The idea is to store files as close to their original form as possible.
This in turn provides the business units and the organization the much needed flexibility and agility without being overly worried by what it can implement.
2. Data Processing Framework
This is a simple functional programming model initially popularized by Google as MapReduce
.
It essentially uses two functions: MAP
and REDUCE
to process data.
The "Mappers" take in a set of key-value pairs and generate intermediate data (which is another list of key-value pairs).
The "Reducers" then act on this input to produce the output data.
The two functions seemingly work in isolation with one another, thus enabling the processing to be highly distributed in highly parallel, fault-tolerance and scalable way.
Limitations of Hadoop 1.0
The first limitation was the requirement of MapReduce
programming expertise.
It supported only batch processing which although is suitable for tasks such as log analysis, large scale data mining projects but pretty much unsuitable for other kinds of projects.
One major limitation was that Hadoop 1.0
was tightly computationally coupled with MapReduce
, which meant that the established data management vendors where left with two opinions:
Either rewrite their functionality in MapReduce
so that it could be
executed in Hadoop
or
Extract data from HDFS
or process it outside of Hadoop
.
None of the options were viable as it led to process inefficiencies caused by data being moved in and out of the Hadoop
cluster.
Hadoop 2.0
In Hadoop 2.0
, HDFS
continues to be data storage framework.
However, a new and seperate resource management framework called Yet Another Resource Negotiater (YARN) has been added.
Any application capable of dividing itself into parallel tasks is supported by YARN.
YARN coordinates the allocation of subtasks of the submitted application, thereby further enhancing the flexibility, scalability and efficiency of applications.
It works by having an Application Master in place of Job Tracker, running applications on resources governed by new Node Manager.
ApplicationMaster is able to run any application and not just MapReduce
.
This means it does not only support batch processing but also real-time processing. MapReduce
is no longer the only data processing option.
Advantages of Hadoop
It stores data in its native from. There is no structure imposed while keying in data or storing data. HDFS
is schema less. It is only later when the data needs to be processed that the structure is imposed on the raw data.
It is scalable. Hadoop
can store and distribute very large datasets across hundreds of inexpensive servers that operate in parallel.
It is resilient to failure. Hadoop
is fault tolerance. It practices replication of data diligently which means whenever data is sent to any node, the same data also gets replicated to other nodes in the cluster, thereby ensuring that in event of node failure,there will always be another copy of data available for use.
It is flexible. One of the key advantages of Hadoop
is that it can work with any kind of data: structured, unstructured or semi-structured. Also, the processing is extremely fast in Hadoop
owing to the "move code to data" paradigm.
Hadoop Ecosystem
Following are the components of Hadoop
ecosystem:
HDFS: Hadoop
Distributed File System. It simply stores data files as close to the original form as possible.
HBase: It is Hadoop's database and compares well with an RDBMS
. It supports structured data storage for large tables.
Hive: It enables analysis of large datasets using a language very similar to standard ANSI SQL
, which implies that anyone familier with SQL
should be able to access data on a Hadoop
cluster.
Pig: It is an easy to understand data flow language. It helps with analysis of large datasets which is quite the order with Hadoop
. Pig
scripts are automatically converted to MapReduce
jobs by the Pig
interpreter.
ZooKeeper: It is a coordination service for distributed applications.
Oozie: It is a workflow schedular
system to manage Apache Hadoop
jobs.
Mahout: It is a scalable machine learning and data mining library.
Chukwa: It is data collection system for managing large distributed system.
Sqoop: It is used to transfer bulk data between Hadoop
and structured data stores such as relational databases.
Ambari: It is a web based tool for provisioning, managing and monitoring Hadoop
clusters.
Hive
Hive
is a data warehouse infrastructure tool to process structured data in Hadoop
. It resides on top of Hadoop
to summarize Big Data and makes querying and analyzing easy.
Hive is not
A relational database
A design for Online Transaction Processing (OLTP
).
A language for real-time queries and row-level updates.
Features of Hive
It stores schema in database and processed data into HDFS
.
It is designed for OLAP
.
It provides SQL
type language for querying called HiveQL
or HQL
.
It is familier, fast, scalable and extensible.
Hive Architecture
The following components are contained in Hive Architecture:
User Interface: Hive
is a data warehouse
infrastructure that can create interaction between user and HDFS
. The User Interfaces that Hive
supports are Hive Web UI, Hive Command line and Hive HD Insight(In Windows Server).
MetaStore: Hive
chooses respective database
servers
to store the schema or Metadata
of tables, databases, columns in a table, their data types and HDFS
mapping.
HiveQL Process Engine: HiveQL
is similar to SQL
for querying on schema info on the Metastore
. It is one of the replacements of traditional approach for MapReduce
program. Instead of writing MapReduce
in Java
, we can write a query for MapReduce
and process it.
Exceution Engine: The conjunction part of HiveQL
process engine and MapReduce
is the Hive
Execution Engine. Execution engine processes the query and generates results as same as MapReduce results
. It uses the flavor of MapReduce
.
HDFS or HBase: Hadoop
Distributed File System or HBase
are the data storage techniques to store data into file system.
Use of Hive, Hbase and Pig w.r.t. my real time experience in different projects.
Hive is used mostly for:
Analytics purpose where you need to do analysis on history data
Generating business reports based on certain columns
Efficiently managing the data together with metadata information
Joining tables on certain columns which are frequently used by using bucketing concept
Efficient Storing and querying using partitioning concept
Not useful for transaction/row level operations like update, delete, etc.
Pig is mostly used for:
Frequent data analysis on huge data
Generating aggregated values/counts on huge data
Generating enterprise level key performance indicators very frequently
Hbase is mostly used:
For real time processing of data
For efficiently managing Complex and nested schema
For real time querying and faster result
For easy Scalability with columns
Useful for transaction/row level operations like update, delete, etc.
Pig: it is better to handle files and cleaning data example: removing null values,string handling,unnecessary values Hive: for querying on cleaned data
I believe this thread hasn't done in particular justice to HBase and Pig in particular. While I believe Hadoop is the choice of the distributed, resilient file-system for big-data lake implementations, the choice between HBase and Hive is in particular well-segregated.
As in, a lot of use-cases have a particular requirement of SQL like or No-SQL like interfaces. With Phoenix on top of HBase, though SQL like capabilities is certainly achievable, however, the performance, third-party integrations, dashboard update are a kind of painful experiences. However, it's an excellent choice for databases requiring horizontal scaling.
Pig is in particular excellent for non-recursive batch like computations or ETL pipelining (somewhere, where it outperforms Spark by a comfortable distance). Also, it's high-level dataflow implementations is an excellent choice for batch querying and scripting. The choice between Pig and Hive is also pivoted on the need of the client or server-side scripting, required file formats, etc. Pig supports Avro file format which is not true in the case of Hive. The choice for 'procedural dataflow language' vs 'declarative data flow language' is also a strong argument for the choice between pig and hive.