Specify a SerDe serialization lib with AWS Glue Crawler

守給你的承諾、 提交于 2021-01-02 20:09:22

问题


Every time I run a glue crawler on existing data, it changes the Serde serialization lib to LazySimpleSerDe, which doesn't classify correctly (e.g. for quoted fields with commas in)

I then need to manually edit the table details in the Glue Catalog to change it to org.apache.hadoop.hive.serde2.OpenCSVSerde.

I've tried making my own csv Classifier but that doesn't help.

How do I get the crawler to specify a particular serialization lib for the tables produced or updated?


回答1:


You can't specify the SerDe in the Glue Crawler at this time but here is a workaround...

  1. Create a Glue Crawler with the following configuration.

    Enable 'Add new columns only’ - This adds new columns as they are discovered, but doesn't remove or change the type of existing columns in the Data Catalog

    Enable 'Update all new and existing partitions with metadata from the table’ - this option inherits metadata properties such as their classification, input format, output format, SerDe information, and schema from their parent table. Any changes to these properties in a table are propagated to its partitions.

  2. Run the crawler to create the table, it will create a table with "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe" - Edit this to the "org.apache.hadoop.hive.serde2.OpenCSVSerde".

  3. Re-run the crawler.

  4. In case a new partition is added on crawler re-run, it will also be created with “org.apache.hadoop.hive.serde2.OpenCSVSerde”.

  5. You should now have a table that is set to org.apache.hadoop.hive.serde2.OpenCSVSerde and does not reset.



来源:https://stackoverflow.com/questions/57498330/specify-a-serde-serialization-lib-with-aws-glue-crawler

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!