When to use Hadoop, HBase, Hive and Pig?

前端 未结 16 1082
时光说笑
时光说笑 2020-12-04 04:21

What are the benefits of using either Hadoop or HBase or Hive ?

From my understanding, HBase avoi

相关标签:
16条回答
  • 2020-12-04 05:05

    First of all we should get clear that Hadoop was created as a faster alternative to RDBMS. To process large amount of data at a very fast rate which earlier took a lot of time in RDBMS.

    Now one should know the two terms :

    1. Structured Data : This is the data that we used in traditional RDBMS and is divided into well defined structures.

    2. Unstructured Data : This is important to understand, about 80% of the world data is unstructured or semi structured. These are the data which are on its raw form and cannot be processed using RDMS. Example : facebook, twitter data. (http://www.dummies.com/how-to/content/unstructured-data-in-a-big-data-environment.html).

    So, large amount of data was being generated in the last few years and the data was mostly unstructured, that gave birth to HADOOP. It was mainly used for very large amount of data that takes unfeasible amount of time using RDBMS. It had many drawbacks, that it could not be used for comparatively small data in real time but they have managed to remove its drawbacks in the newer version.

    Before going further I would like to tell that a new Big Data tool is created when they see a fault on the previous tools. So, whichever tool you will see that is created has been done to overcome the problem of the previous tools.

    Hadoop can be simply said as two things : Mapreduce and HDFS. Mapreduce is where the processing takes place and HDFS is the DataBase where data is stored. This structure followed WORM principal i.e. write once read multiple times. So, once we have stored data in HDFS, we cannot make changes. This led to the creation of HBASE, a NOSQL product where we can make changes in the data also after writing it once.

    But with time we saw that Hadoop had many faults and for that we created different environment over the Hadoop structure. PIG and HIVE are two popular examples.

    HIVE was created for people with SQL background. The queries written is similar to SQL named as HIVEQL. HIVE was developed to process completely structured data. It is not used for ustructured data.

    PIG on the other hand has its own query language i.e. PIG LATIN. It can be used for both structured as well as unstructured data.

    Moving to the difference as when to use HIVE and when to use PIG, I don't think anyone other than the architect of PIG could say. Follow the link : https://developer.yahoo.com/blogs/hadoop/comparing-pig-latin-sql-constructing-data-processing-pipelines-444.html

    0 讨论(0)
  • 2020-12-04 05:09

    Short answer to this question is -

    Hadoop - Is Framework which facilitates distributed file system and programming model which allow us to store humongous sized data and process data in distributed fashion very efficiently and with very less processing time compare to traditional approaches.

    (HDFS - Hadoop Distributed File system) (Map Reduce - Programming Model for distributed processing)

    Hive - Is query language which allows to read/write data from Hadoop distributed file system in a very popular SQL like fashion. This made life easier for many non-programming background people as they don't have to write Map-Reduce program anymore except for very complex scenarios where Hive is not supported.

    Hbase - Is Columnar NoSQL Database. Underlying storage layer for Hbase is again HDFS. Most important use case for this database is to be able to store billion's of rows with million's of columns. Low latency feature of Hbase helps faster and random access of record over distributed data, is very important feature to make it useful for complex projects like Recommender Engines. Also it's record level versioning capability allow user to store transactional data very efficiently (this solves the problem of updating records we have with HDFS and Hive)

    Hope this is helpful to quickly understand the above 3 features.

    0 讨论(0)
  • 2020-12-04 05:11

    Let me try to answer in few words.

    Hadoop is an eco-system which comprises of all other tools. So, you can't compare Hadoop but you can compare MapReduce.

    Here are my few cents:

    1. Hive: If your need is very SQLish meaning your problem statement can be catered by SQL, then the easiest thing to do would be to use Hive. The other case, when you would use hive is when you want a server to have certain structure of data.
    2. Pig: If you are comfortable with Pig Latin and you need is more of the data pipelines. Also, your data lacks structure. In those cases, you could use Pig. Honestly there is not much difference between Hive & Pig with respect to the use cases.
    3. MapReduce: If your problem can not be solved by using SQL straight, you first should try to create UDF for Hive & Pig and then if the UDF is not solving the problem then getting it done via MapReduce makes sense.
    0 讨论(0)
  • 2020-12-04 05:12

    Hadoop is a a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

    There are four main modules in Hadoop.

    1. Hadoop Common: The common utilities that support the other Hadoop modules.

    2. Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.

    3. Hadoop YARN: A framework for job scheduling and cluster resource management.

    4. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

    Before going further, Let's note that we have three different types of data.

    • Structured: Structured data has strong schema and schema will be checked during write & read operation. e.g. Data in RDBMS systems like Oracle, MySQL Server etc.

    • Unstructured: Data does not have any structure and it can be any form - Web server logs, E-Mail, Images etc.

    • Semi-structured: Data is not strictly structured but have some structure. e.g. XML files.

    Depending on type of data to be processed, we have to choose right technology.

    Some more projects, which are part of Hadoop:

    • HBase™: A scalable, distributed database that supports structured data storage for large tables.

    • Hive™: A data warehouse infrastructure that provides data summarization and ad-hoc querying.

    • Pig™: A high-level data-flow language and execution framework for parallel computation.

    Hive Vs PIG comparison can be found at this article and my other post at this SE question.

    HBASE won't replace Map Reduce. HBase is scalable distributed database & Map Reduce is programming model for distributed processing of data. Map Reduce may act on data in HBASE in processing.

    You can use HIVE/HBASE for structured/semi-structured data and process it with Hadoop Map Reduce

    You can use SQOOP to import structured data from traditional RDBMS database Oracle, SQL Server etc and process it with Hadoop Map Reduce

    You can use FLUME for processing Un-structured data and process with Hadoop Map Reduce

    Have a look at: Hadoop Use Cases.

    Hive should be used for analytical querying of data collected over a period of time. e.g Calculate trends, summarize website logs but it can't be used for real time queries.

    HBase fits for real-time querying of Big Data. Facebook use it for messaging and real-time analytics.

    PIG can be used to construct dataflows, run a scheduled jobs, crunch big volumes of data, aggregate/summarize it and store into relation database systems. Good for ad-hoc analysis.

    Hive can be used for ad-hoc data analysis but it can't support all un-structured data formats unlike PIG.

    0 讨论(0)
提交回复
热议问题