How to solve this HIVE_PARTITION_SCHEMA_MISMATCH?

不羁的心 提交于 2020-01-06 07:01:45

问题


I have partitioned data in CSV files on S3:

  • s3://bucket/dataset/p=1/*.csv (partition #1)
  • ...
  • s3://bucket/dataset/p=100/*.csv (partition #100)

I run a classifier over s3://bucket/dataset/ and the result looks very much promising as it detects 150 columns (c1,...,c150) and assigns various data types.

Loading the resulting table in Athena and querying (select * from dataset limit 10) it though will yield the error message:

HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'c100' in table 'tests.dataset' is declared as type 'string', but partition 'AANtbd7L1ajIwMTkwOQ' declared column 'c100' as type 'boolean'.

First of all I have no idea how to make use of 'AANtbd7L1ajIwMTkwOQ' ... but I can tell from the list of partitions in Glue that some partitions have c100 classified as string and some as boolean. While the table schema lists it as string.

That also means if I restrict a query to a partition which classifies c100 as string agreeing with the table schema then the query will work. If I use a partition classifying c100 as boolean the query fails with above error message.

Now from having a look at some of the CSVs column c100 seems to contain three different values:

  • true
  • false
  • [empty] (like ...,,...)

Possibly some row contains a typo (maybe) and hence some partitions classify as string - but that is just a theory and a difficult to verify due to the number and size of the files.

I also tried MSCK REPAIR TABLE dataset to no avail.

Is there a quick solution to this? Maybe forcing all partition to use string? If I look at the list of partitions there is a deactivated "edit schema" button.

Or do I have to write a Glue job checking and discarding or repairing every row?


回答1:


If you are using crawler, you should select following option:

Update all new and existing partitions with metadata from the table

You may do it while creating table too. Check https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html#crawler-schema-changes-prevent for more details.

This should solve issue. If it doesn't then check other options at https://github.com/awsdocs/amazon-athena-user-guide/blob/master/doc_source/glue-best-practices.md#schema-syncing

For understanding issue in athena, check https://docs.aws.amazon.com/athena/latest/ug/updates-and-partitions.html



来源:https://stackoverflow.com/questions/57890280/how-to-solve-this-hive-partition-schema-mismatch

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!