There are at least two different ways of creating a hive table backed with Avro data:
1) Creating a table based on an Avro schema (in this example stored in hdfs):
The following refers to the use-case where no schema file is involved
The schema is stored in 2 places
1. The metastore
2. As part of the data files
All the information for the DESC/SHOW commands is taken from the metastore.
Every DDL change impacts only the metastore.
When you query the data the matching between the 2 schemas is done by the columns names.
If there is a mismatch in the columns types you'll get an error.
create table mytable
stored as avro
as
select 1 as myint
,'Hello' as mystring
,current_date as mydate
;
select * from mytable
;
+-------+----------+------------+
| myint | mystring | mydate |
+-------+----------+------------+
| 1 | Hello | 2017-05-30 |
+-------+----------+------------+
Metastore
select c.column_name
,c.integer_idx
,c.type_name
from metastore.DBS as d
join metastore.TBLS as t on t.db_id = d.db_id
join metastore.SDS as s on s.sd_id = t.sd_id
join metastore.COLUMNS_V2 as c on c.cd_id = s.cd_id
where d.name = 'local_db'
and t.tbl_name = 'mytable'
order by integer_idx
+-------------+-------------+-----------+
| column_name | integer_idx | type_name |
+-------------+-------------+-----------+
| myint | 0 | int |
| mystring | 1 | string |
| mydate | 2 | date |
+-------------+-------------+-----------+
avro-tools
bash-4.1$ avro-tools getschema 000000_0
{
"type" : "record",
"name" : "mytable",
"namespace" : "local_db",
"fields" : [ {
"name" : "myint",
"type" : [ "null", "int" ],
"default" : null
}, {
"name" : "mystring",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "mydate",
"type" : [ "null", {
"type" : "int",
"logicalType" : "date"
} ],
"default" : null
} ]
}
alter table mytable change myint dummy1 int;
select * from mytable;
+--------+----------+------------+
| dummy1 | mystring | mydate |
+--------+----------+------------+
| (null) | Hello | 2017-05-30 |
+--------+----------+------------+
alter table mytable add columns (myint int);
select * from mytable;
+--------+----------+------------+-------+
| dummy1 | mystring | mydate | myint |
+--------+----------+------------+-------+
| (null) | Hello | 2017-05-30 | 1 |
+--------+----------+------------+-------+
Metastore
+-------------+-------------+-----------+
| column_name | integer_idx | type_name |
+-------------+-------------+-----------+
| dummy1 | 0 | int |
| mystring | 1 | string |
| mydate | 2 | date |
| myint | 3 | int |
+-------------+-------------+-----------+
avro-tools
(same schema as the original one)
bash-4.1$ avro-tools getschema 000000_0
{
"type" : "record",
"name" : "mytable",
"namespace" : "local_db",
"fields" : [ {
"name" : "myint",
"type" : [ "null", "int" ],
"default" : null
}, {
"name" : "mystring",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "mydate",
"type" : [ "null", {
"type" : "int",
"logicalType" : "date"
} ],
"default" : null
} ]
}
Any work against that table is done based on the metadata stored in the Metastore.
When the table is being queried, additional metadata is being used which is the metadata stored in data file.
The query result structure is constructed from the Metastore (See in my example that 4 columns are being returned after the table was altered).
The data returned depends on both schemes - a field with a specific name in the file schema will be mapped to the column with the same name in the Metastore schema.
If the names match but the datatypes don't, an error will arise.
A fields from the data file that does not have a corresponding column name in the Metastore would not be presented.
A column in the Metastore without corresponding field in the data file schema will hold NULL values.