问题
I have partitioned data in CSV files on S3:
- s3://bucket/dataset/p=1/*.csv (partition #1)
- ...
- s3://bucket/dataset/p=100/*.csv (partition #100)
I run a classifier over s3://bucket/dataset/ and the result looks very much promising as it detects 150 columns (c1,...,c150) and assigns various data types.
Loading the resulting table in Athena and querying (select * from dataset limit 10
) it though will yield the error message:
HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'c100' in table 'tests.dataset' is declared as type 'string', but partition 'AANtbd7L1ajIwMTkwOQ' declared column 'c100' as type 'boolean'.
First of all I have no idea how to make use of 'AANtbd7L1ajIwMTkwOQ' ... but I can tell from the list of partitions in Glue that some partitions have c100 classified as string and some as boolean. While the table schema lists it as string.
That also means if I restrict a query to a partition which classifies c100 as string agreeing with the table schema then the query will work. If I use a partition classifying c100 as boolean the query fails with above error message.
Now from having a look at some of the CSVs column c100 seems to contain three different values:
- true
- false
- [empty] (like ...,,...)
Possibly some row contains a typo (maybe) and hence some partitions classify as string - but that is just a theory and a difficult to verify due to the number and size of the files.
I also tried MSCK REPAIR TABLE dataset
to no avail.
Is there a quick solution to this? Maybe forcing all partition to use string? If I look at the list of partitions there is a deactivated "edit schema" button.
Or do I have to write a Glue job checking and discarding or repairing every row?
回答1:
If you are using crawler, you should select following option:
Update all new and existing partitions with metadata from the table
You may do it while creating table too. Check https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html#crawler-schema-changes-prevent for more details.
This should solve issue. If it doesn't then check other options at https://github.com/awsdocs/amazon-athena-user-guide/blob/master/doc_source/glue-best-practices.md#schema-syncing
For understanding issue in athena, check https://docs.aws.amazon.com/athena/latest/ug/updates-and-partitions.html
来源:https://stackoverflow.com/questions/57890280/how-to-solve-this-hive-partition-schema-mismatch