SPARK-HIVE-key differences between Hive and Parquet from the perspective of table schema processing

后端 未结 1 466
眼角桃花
眼角桃花 2021-01-16 02:40

I am new in spark and hive. I do not understand the statement

\"Hive considers all columns nullable, while nullability in Parquet is significant\"

If any o

1条回答
  •  一向
    一向 (楼主)
    2021-01-16 02:56

    In standard SQL syntax, when you create a table, you can state that a specific column is "nullable" (i.e. may contain a Null value) or not (i.e. trying to insert/update a Null value will throw an error).
    Nullable is the default.

    Parquet schema syntax supports the same concept, although when using AVRO serialization, not-nullable is the default.
    Caveat -- when you use Spark to read multiple Parquet files, these files may have different schemas. Imagine that the schema definition has changed over time, and newer files have 2 more Nullable columns at the end. Then you have to request "schema merging" so that Spark reads the schema from all files (not just one at random) to make sure that all these schemas are compatible, then at read-time the "undefined" columns are defaulted to Null for older files.

    Hive HQL syntax does not support the standard SQL feature; every column is, and must be, nullable -- simply because Hive does not have total control on its data files!
    Imagine a Hive partitioned table with 2 partitions...

    • one partition uses TextFile format and contains CSV dumps from different sources, some showing up all expected columns, some missing the last 2 columns because they use an older definition
    • the second partition uses Parquet format for history, created by Hive INSERT-SELECT queries, but older Parquet files are missing the last 2 columns also, because they were created using the older table definition

    For the Parquet-based partition, Hive does "schema merging", but instead of merging the file schemas together (like Spark), it merges each file schema with the table schema -- ignoring columns that are not defined in the table, and defaulting to Null all table columns that are not in the file.

    Note that for the CSV-based partition, it's much more brutal, because the CSV files don't have a "schema" -- they just have a list of values that are mapped to the table columns, in order. On reaching EOL all missing columns are set to Null; on reaching the value for the last column, any extra value on the line is ignored.

    0 讨论(0)
提交回复
热议问题