Where an Avro schema is stored when I create a hive table with 'STORED AS AVRO' clause?

前端 未结 2 666
难免孤独
难免孤独 2021-02-04 09:26

There are at least two different ways of creating a hive table backed with Avro data:

1) Creating a table based on an Avro schema (in this example stored in hdfs):

2条回答
  •  被撕碎了的回忆
    2021-02-04 10:08

    The following refers to the use-case where no schema file is involved

    The schema is stored in 2 places
    1. The metastore
    2. As part of the data files

    All the information for the DESC/SHOW commands is taken from the metastore.
    Every DDL change impacts only the metastore.

    When you query the data the matching between the 2 schemas is done by the columns names.
    If there is a mismatch in the columns types you'll get an error.

    Demo

    create table mytable 
    stored as avro 
    as 
    select  1               as myint
           ,'Hello'         as mystring
           ,current_date    as mydate
    ;
    

    select * from mytable
    ;
    

    +-------+----------+------------+
    | myint | mystring |   mydate   |
    +-------+----------+------------+
    |     1 | Hello    | 2017-05-30 |
    +-------+----------+------------+
    

    Metastore

    select      c.column_name
               ,c.integer_idx
               ,c.type_name
    
    from                metastore.DBS        as d
                join    metastore.TBLS       as t on t.db_id = d.db_id
                join    metastore.SDS        as s on s.sd_id = t.sd_id
                join    metastore.COLUMNS_V2 as c on c.cd_id = s.cd_id
    
    where       d.name     = 'local_db'
            and t.tbl_name = 'mytable'
    
    order by    integer_idx
    

    +-------------+-------------+-----------+
    | column_name | integer_idx | type_name |
    +-------------+-------------+-----------+
    | myint       |           0 | int       |
    | mystring    |           1 | string    |
    | mydate      |           2 | date      |
    +-------------+-------------+-----------+
    

    avro-tools

    bash-4.1$ avro-tools getschema 000000_0 
    
    {
      "type" : "record",
      "name" : "mytable",
      "namespace" : "local_db",
      "fields" : [ {
        "name" : "myint",
        "type" : [ "null", "int" ],
        "default" : null
      }, {
        "name" : "mystring",
        "type" : [ "null", "string" ],
        "default" : null
      }, {
        "name" : "mydate",
        "type" : [ "null", {
          "type" : "int",
          "logicalType" : "date"
        } ],
        "default" : null
      } ]
    }
    

    alter table mytable change myint dummy1 int;
    

    select * from mytable;
    

    +--------+----------+------------+
    | dummy1 | mystring |   mydate   |
    +--------+----------+------------+
    | (null) | Hello    | 2017-05-30 |
    +--------+----------+------------+
    

    alter table mytable add columns (myint int);
    

    select * from mytable;
    

    +--------+----------+------------+-------+
    | dummy1 | mystring |   mydate   | myint |
    +--------+----------+------------+-------+
    | (null) | Hello    | 2017-05-30 |     1 |
    +--------+----------+------------+-------+
    

    Metastore

    +-------------+-------------+-----------+
    | column_name | integer_idx | type_name |
    +-------------+-------------+-----------+
    | dummy1      |           0 | int       |
    | mystring    |           1 | string    |
    | mydate      |           2 | date      |
    | myint       |           3 | int       |
    +-------------+-------------+-----------+
    

    avro-tools
    (same schema as the original one)

    bash-4.1$ avro-tools getschema 000000_0 
    
    {
      "type" : "record",
      "name" : "mytable",
      "namespace" : "local_db",
      "fields" : [ {
        "name" : "myint",
        "type" : [ "null", "int" ],
        "default" : null
      }, {
        "name" : "mystring",
        "type" : [ "null", "string" ],
        "default" : null
      }, {
        "name" : "mydate",
        "type" : [ "null", {
          "type" : "int",
          "logicalType" : "date"
        } ],
        "default" : null
      } ]
    }
    

    Any work against that table is done based on the metadata stored in the Metastore.
    When the table is being queried, additional metadata is being used which is the metadata stored in data file.
    The query result structure is constructed from the Metastore (See in my example that 4 columns are being returned after the table was altered).
    The data returned depends on both schemes - a field with a specific name in the file schema will be mapped to the column with the same name in the Metastore schema.
    If the names match but the datatypes don't, an error will arise.
    A fields from the data file that does not have a corresponding column name in the Metastore would not be presented.
    A column in the Metastore without corresponding field in the data file schema will hold NULL values.

提交回复
热议问题