How does Hive stores data and what is SerDe?

后端 未结 4 1178
执笔经年
执笔经年 2021-02-04 12:50

when querying a table, a SerDe will deserialize a row of data from the bytes in the file to objects used internally by Hive to operate on that row of data. when

相关标签:
4条回答
  • 2021-02-04 13:31

    Hive can analyse semi structured and unstructured data as well by using (1) complex data type(struct,array,unions) (2) By using SerDe

    SerDe interface allow us to instruct hive as to how the record should be processed. Serializer will take java object that hive has been working on,and convert it into something that hive can store and Deserializer take binary representation of a record and translate into java object that hive can manipulate.

    0 讨论(0)
  • 2021-02-04 13:33

    In this aspect we can see Hive as some kind of database engine. This engine is working on tables which are built from records.
    When we let Hive (as well as any other database) to work in its own internal formats - we do not care.
    When we want Hive to process our own files as tables (external tables) we have to let him know - how to translate data in files into records. This is exactly the role of SerDe. You can see it as plug-in which enables Hive to read / write your data.
    For example - you want to work with CSV. Here is example of CSV_Serde https://github.com/ogrodnek/csv-serde/blob/master/src/main/java/com/bizo/hive/serde/csv/CSVSerde.java Method serialize will read the data, and chop it into fields assuming it is CSV
    Method deserialize will take a record and format it as CSV.

    0 讨论(0)
  • 2021-02-04 13:45

    Answers

    1. Yes, SerDe is a Library which is built-in to the Hadoop API
    2. Hive uses Files systems like HDFS or any other storage (FTP) to store data, data here is in the form of tables (which has rows and columns).
    3. SerDe - Serializer, Deserializer instructs hive on how to process a record (Row). Hive enables semi-structured (XML, Email, etc) or unstructured records (Audio, Video, etc) to be processed also. For Example If you have 1000 GB worth of RSS Feeds (RSS XMLs). You can ingest those to a location in HDFS. You would need to write a custom SerDe based on your XML structure so that Hive knows how to load XML files to Hive tables or other way around.

    For more information on how to write a SerDe read this post

    0 讨论(0)
  • 2021-02-04 13:47

    I think the above has the concepts serialise and deserialise back to front. Serialise is done on write, the structured data is serialised into a bit/byte stream for storage. On read, the data is deserialised from the bit/byte storage format to the structure required by the reader. eg Hive needs structures that look like rows and columns but hdfs stores the data in bit/byte blocks, so serialise on write, deserialise on read.

    0 讨论(0)
提交回复
热议问题