Where an Avro schema is stored when I create a hive table with 'STORED AS AVRO' clause?

前端未结

关注

 2  673

难免孤独 2021-02-04 09:26

There are at least two different ways of creating a hive table backed with Avro data:

1) Creating a table based on an Avro schema (in this example stored in hdfs):

2条回答

被撕碎了的回忆 (楼主)

2021-02-04 10:08

The following refers to the use-case where no schema file is involved

The schema is stored in 2 places
1. The metastore
2. As part of the data files

All the information for the DESC/SHOW commands is taken from the metastore.
Every DDL change impacts only the metastore.

When you query the data the matching between the 2 schemas is done by the columns names.
If there is a mismatch in the columns types you'll get an error.

Demo

create table mytable 
stored as avro 
as 
select  1               as myint
       ,'Hello'         as mystring
       ,current_date    as mydate
;

select * from mytable
;

+-------+----------+------------+
| myint | mystring |   mydate   |
+-------+----------+------------+
|     1 | Hello    | 2017-05-30 |
+-------+----------+------------+

Metastore

select      c.column_name
           ,c.integer_idx
           ,c.type_name

from                metastore.DBS        as d
            join    metastore.TBLS       as t on t.db_id = d.db_id
            join    metastore.SDS        as s on s.sd_id = t.sd_id
            join    metastore.COLUMNS_V2 as c on c.cd_id = s.cd_id

where       d.name     = 'local_db'
        and t.tbl_name = 'mytable'

order by    integer_idx

+-------------+-------------+-----------+
| column_name | integer_idx | type_name |
+-------------+-------------+-----------+
| myint       |           0 | int       |
| mystring    |           1 | string    |
| mydate      |           2 | date      |
+-------------+-------------+-----------+

avro-tools

bash-4.1$ avro-tools getschema 000000_0 

{
  "type" : "record",
  "name" : "mytable",
  "namespace" : "local_db",
  "fields" : [ {
    "name" : "myint",
    "type" : [ "null", "int" ],
    "default" : null
  }, {
    "name" : "mystring",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "mydate",
    "type" : [ "null", {
      "type" : "int",
      "logicalType" : "date"
    } ],
    "default" : null
  } ]
}

alter table mytable change myint dummy1 int;

select * from mytable;

+--------+----------+------------+
| dummy1 | mystring |   mydate   |
+--------+----------+------------+
| (null) | Hello    | 2017-05-30 |
+--------+----------+------------+

alter table mytable add columns (myint int);

select * from mytable;

+--------+----------+------------+-------+
| dummy1 | mystring |   mydate   | myint |
+--------+----------+------------+-------+
| (null) | Hello    | 2017-05-30 |     1 |
+--------+----------+------------+-------+

Metastore

+-------------+-------------+-----------+
| column_name | integer_idx | type_name |
+-------------+-------------+-----------+
| dummy1      |           0 | int       |
| mystring    |           1 | string    |
| mydate      |           2 | date      |
| myint       |           3 | int       |
+-------------+-------------+-----------+

avro-tools
(same schema as the original one)

bash-4.1$ avro-tools getschema 000000_0 

{
  "type" : "record",
  "name" : "mytable",
  "namespace" : "local_db",
  "fields" : [ {
    "name" : "myint",
    "type" : [ "null", "int" ],
    "default" : null
  }, {
    "name" : "mystring",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "mydate",
    "type" : [ "null", {
      "type" : "int",
      "logicalType" : "date"
    } ],
    "default" : null
  } ]
}

Any work against that table is done based on the metadata stored in the Metastore.
When the table is being queried, additional metadata is being used which is the metadata stored in data file.
The query result structure is constructed from the Metastore (See in my example that 4 columns are being returned after the table was altered).
The data returned depends on both schemes - a field with a specific name in the file schema will be mapped to the column with the same name in the Metastore schema.
If the names match but the datatypes don't, an error will arise.
A fields from the data file that does not have a corresponding column name in the Metastore would not be presented.
A column in the Metastore without corresponding field in the data file schema will hold NULL values.

0 讨论(0)

查看其它2个回答