Where an Avro schema is stored when I create a hive table with 'STORED AS AVRO' clause?

前端 未结 2 671
难免孤独
难免孤独 2021-02-04 09:26

There are at least two different ways of creating a hive table backed with Avro data:

1) Creating a table based on an Avro schema (in this example stored in hdfs):

2条回答
  •  抹茶落季
    2021-02-04 10:23

    I decided to publish a complementary answer to those given by @DuduMarkovitz.

    To make code examples more concise let's clarify that STORED AS AVRO clause is an equivalent of these three lines:

    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
    STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
    

    Let's take a look then at what happens when we create a table giving a reference to avro schema stored in hdfs. Here is the schema:

    {
      "namespace": "io.sqooba",
      "name": "user",
      "type": "record",
      "fields": [
        {"name": "id", "type": "int"},
        {"name": "name", "type": "string"}
      ]
    }
    

    We create our table with the following command:

    CREATE TABLE users_from_avro_schema
    STORED AS AVRO
    TBLPROPERTIES ('avro.schema.url'='hdfs:///user/tulinski/user.avsc');
    

    Hive has inferred the schema properly, which we can see by calling:

    hive> DESCRIBE users_from_avro_schema;
    OK
    id                      int
    name                    string
    

    Hive Metastore shows us the same (I use @DuduMarkovitz's query):

    +------------------------+-------------+-------------+-----------+
    | tbl_name               | column_name | integer_idx | type_name |
    +------------------------+-------------+-------------+-----------+
    | users_from_avro_schema | id          |           0 | int       |
    | users_from_avro_schema | name        |           1 | string    |
    +------------------------+-------------+-------------+-----------+
    

    So far, so good, everything works as we expect. But let's see what happens when we update avro.schema.url property to point to the next version of our schema (users_v2.avsc), which is as follows:

    {
      "namespace": "io.sqooba",
      "name": "user",
      "type": "record",
      "fields": [
        {"name": "id", "type": "int"},
        {"name": "name", "type": "string"},
        {"name": "email", "type": ["null", "string"], "default":null}
      ]
    }
    

    We simply added another field called email.
    Now we update a table property pointing to the avro schema in hdfs:

    ALTER TABLE users_from_avro_schema SET TBLPROPERTIES('avro.schema.url'='hdfs:///user/tulinski/user_v2.avsc');
    

    Has table metadata been changed?

    hive> DESCRIBE users_from_avro_schema;
    OK
    id                      int
    name                    string
    email                   string
    

    Yeah, cool! But do you expect that Hive Metastore contains this additional column?
    Unfortunately in Metastore nothing changed:

    +------------------------+-------------+-------------+-----------+
    | tbl_name               | column_name | integer_idx | type_name |
    +------------------------+-------------+-------------+-----------+
    | users_from_avro_schema | id          |           0 | int       |
    | users_from_avro_schema | name        |           1 | string    |
    +------------------------+-------------+-------------+-----------+
    

    I suspect that Hive has the following strategy of inferring schema: It tries to get it from a SerDe class specified for a given table. When SerDe cannot provide the schema Hive looks into the metastore.
    Let's check that by removing avro.schema.url property:

    hive> ALTER TABLE users_from_avro_schema UNSET TBLPROPERTIES ('avro.schema.url');
    OK
    Time taken: 0.33 seconds
    hive> DESCRIBE users_from_avro_schema;
    OK
    id                      int
    name                    string
    Time taken: 0.363 seconds, Fetched: 2 row(s)
    

    Describe shows us data stored in the Metastore. Let's modify them by adding a column:

    ALTER TABLE users_from_avro_schema ADD COLUMNS (phone string);
    

    It of course changes Hive Metastore:

    +------------------------+-------------+-------------+-----------+
    | tbl_name               | column_name | integer_idx | type_name |
    +------------------------+-------------+-------------+-----------+
    | users_from_avro_schema | id          |           0 | int       |
    | users_from_avro_schema | name        |           1 | string    |
    | users_from_avro_schema | phone       |           2 | string    |
    +------------------------+-------------+-------------+-----------+
    

    But when we set avro.schema.url again back to user_v2.avsc what is in Hive Metastore doesn't matter any more:

    hive> ALTER TABLE users_from_avro_schema SET TBLPROPERTIES('avro.schema.url'='hdfs:///user/tulinski/user_v2.avsc');
    OK
    Time taken: 0.268 seconds
    hive> DESCRIBE users_from_avro_schema;
    OK
    id                      int
    name                    string
    email                   string
    

    Avro schema takes precedence over the Metastore.

    The above example shows that we should rather avoid mixing hive schema changes with avro schema evolution, because otherwise we can easily get into big mess and inconsistency between Hive Metastore and actual schema which is used while reading and writing data. The first inconsistency occurs when we change our avro schema definition by updating avro.schema.url property, but we can live with that if we are aware of Hive strategy of inferring schema. I haven't checked in Hive's source code whether my suspicions about schema logic are correct, but the above example convince me what happens underneath.

    I extended my answer to show that even when there is a conflict between Avro schema and Hive Metastore data which comply Avro schema can be read. Please have a look again at my example above. Our table definition points to avro schema having three fields:

    id    int
    name  string
    email string
    

    whereas in Hive Metastore there are the following columns:

    id    int
    name  string
    phone string
    

    email vs phone
    Let's create an avro file containing a single user record complying user_v2.avsc schema. This is its json representation:

    {
      "id": 123,
      "name": "Tomek",
      "email": {"string": "tomek@tomek"}
    }
    

    To create the avro file we call:

    java -jar avro-tools-1.8.2.jar fromjson --schema-file user_v2.avsc user_tomek_v2.json > user_tomek_v2.avro
    

    We are able to query our table despite the fact that Hive Metastore doesn't contain email column and it contains phone column instead:

    hive> set hive.cli.print.header=true;
    hive> select * from users_from_avro_schema;
    OK
    users_from_avro_schema.id   users_from_avro_schema.name users_from_avro_schema.email
    123 Tomek   tomek@tomek
    

提交回复
热议问题