When to use Hadoop, HBase, Hive and Pig?

前端 未结 16 1081
时光说笑
时光说笑 2020-12-04 04:21

What are the benefits of using either Hadoop or HBase or Hive ?

From my understanding, HBase avoi

相关标签:
16条回答
  • 2020-12-04 04:54

    1.We are using Hadoop for storing Large data (i.e.structure,Unstructure and Semistructure data ) in the form file format like txt,csv.

    2.If We want columnar Updations in our data then we are using Hbase tool

    3.In case of Hive , we are storing Big data which is in structured format and in addition to that we are providing Analysis on that data.

    4.Pig is tool which is using Pig latin language to analyze data which is in any format(structure,semistructure and unstructure).

    0 讨论(0)
  • 2020-12-04 05:00

    I worked on Lambda architecture processing Real time and Batch loads. Real time processing is needed where fast decisions need to be taken in case of Fire alarm send by sensor or fraud detection in case of banking transactions. Batch processing is needed to summarize data which can be feed into BI systems.

    we used Hadoop ecosystem technologies for above applications.

    Real Time Processing

    Apache Storm: Stream Data processing, Rule application

    HBase: Datastore for serving Realtime dashboard

    Batch Processing Hadoop: Crunching huge chunk of data. 360 degrees overview or adding context to events. Interfaces or frameworks like Pig, MR, Spark, Hive, Shark help in computing. This layer needs scheduler for which Oozie is good option.

    Event Handling layer

    Apache Kafka was first layer to consume high velocity events from sensor. Kafka serves both Real Time and Batch analytics data flow through Linkedin connectors.

    0 讨论(0)
  • 2020-12-04 05:01

    MapReduce is just a computing framework. HBase has nothing to do with it. That said, you can efficiently put or fetch data to/from HBase by writing MapReduce jobs. Alternatively you can write sequential programs using other HBase APIs, such as Java, to put or fetch the data. But we use Hadoop, HBase etc to deal with gigantic amounts of data, so that doesn't make much sense. Using normal sequential programs would be highly inefficient when your data is too huge.

    Coming back to the first part of your question, Hadoop is basically 2 things: a Distributed FileSystem (HDFS) + a Computation or Processing framework (MapReduce). Like all other FS, HDFS also provides us storage, but in a fault tolerant manner with high throughput and lower risk of data loss (because of the replication). But, being a FS, HDFS lacks random read and write access. This is where HBase comes into picture. It's a distributed, scalable, big data store, modelled after Google's BigTable. It stores data as key/value pairs.

    Coming to Hive. It provides us data warehousing facilities on top of an existing Hadoop cluster. Along with that it provides an SQL like interface which makes your work easier, in case you are coming from an SQL background. You can create tables in Hive and store data there. Along with that you can even map your existing HBase tables to Hive and operate on them.

    While Pig is basically a dataflow language that allows us to process enormous amounts of data very easily and quickly. Pig basically has 2 parts: the Pig Interpreter and the language, PigLatin. You write Pig script in PigLatin and using Pig interpreter process them. Pig makes our life a lot easier, otherwise writing MapReduce is always not easy. In fact in some cases it can really become a pain.

    I had written an article on a short comparison of different tools of the Hadoop ecosystem some time ago. It's not an in depth comparison, but a short intro to each of these tools which can help you to get started. (Just to add on to my answer. No self promotion intended)

    Both Hive and Pig queries get converted into MapReduce jobs under the hood.

    HTH

    0 讨论(0)
  • 2020-12-04 05:02

    Cleansing Data in Pig is very easy,a suitable approach would be cleansing data through pig and then processing data through hive and later uploading it to hdfs.

    0 讨论(0)
  • 2020-12-04 05:04

    I implemented a Hive Data platform recently in my firm and can speak to it in first person since I was a one man team.

    Objective

    1. To have the daily web log files collected from 350+ servers daily queryable thru some SQL like language
    2. To replace daily aggregation data generated thru MySQL with Hive
    3. Build Custom reports thru queries in Hive

    Architecture Options

    I benchmarked the following options:

    1. Hive+HDFS
    2. Hive+HBase - queries were too slow so I dumped this option

    Design

    1. Daily log Files were transported to HDFS
    2. MR jobs parsed these log files and output files in HDFS
    3. Create Hive tables with partitions and locations pointing to HDFS locations
    4. Create Hive query scripts (call it HQL if you like as diff from SQL) that in turn ran MR jobs in the background and generated aggregation data
    5. Put all these steps into an Oozie workflow - scheduled with Daily Oozie Coordinator

    Summary

    HBase is like a Map. If you know the key, you can instantly get the value. But if you want to know how many integer keys in Hbase are between 1000000 and 2000000 that is not suitable for Hbase alone.

    If you have data that needs to be aggregated, rolled up, analyzed across rows then consider Hive.

    Hopefully this helps.

    Hive actually rocks ...I know, I have lived it for 12 months now... So does HBase...

    0 讨论(0)
  • 2020-12-04 05:04

    For a Comparison Between Hadoop Vs Cassandra/HBase read this post.

    Basically HBase enables really fast read and writes with scalability. How fast and scalable? Facebook uses it to manage its user statuses, photos, chat messages etc. HBase is so fast sometimes stacks have been developed by Facebook to use HBase as the data store for Hive itself.

    Where As Hive is more like a Data Warehousing solution. You can use a syntax similar to SQL to query Hive contents which results in a Map Reduce job. Not ideal for fast, transactional systems.

    0 讨论(0)
提交回复
热议问题