Athena puts data in incorrect columns when input data format changes

后端 未结 1 1101
鱼传尺愫
鱼传尺愫 2021-01-27 14:42

We have some pipe delimited .txt reports coming into a folder in S3, on which we run Glue crawler to determine the schema and query in Athena.

The format of the report c

相关标签:
1条回答
  • 2021-01-27 15:19

    I think this is yet another case of Glue overpromising and underdelivering. As long as the data format is delimited text Glue will do the wrong thing if you add columns in the middle. Adding or removing (but not both) columns to the end works, but not in the middle. Athena does not support different columns for different partitions, so there is no way that Glue could make this work – but it makes it look like it can.

    You will either have to rewrite the data, change to add the columns last, or switch to a different data format where the files contain enough metadata for this not to be a problem: JSON, Avro, or Parquet.

    I would suggest you stop using Glue crawlers altogether, it looks like it's a general tool, but really solves few use cases. See https://stackoverflow.com/a/56439429/1109 for some suggestions on what to do instead.

    0 讨论(0)
提交回复
热议问题