Specify a SerDe serialization lib with AWS Glue Crawler

问题

Every time I run a glue crawler on existing data, it changes the Serde serialization lib to LazySimpleSerDe, which doesn't classify correctly (e.g. for quoted fields with commas in)

I then need to manually edit the table details in the Glue Catalog to change it to org.apache.hadoop.hive.serde2.OpenCSVSerde.

I've tried making my own csv Classifier but that doesn't help.

How do I get the crawler to specify a particular serialization lib for the tables produced or updated?

回答1:

You can't specify the SerDe in the Glue Crawler at this time but here is a workaround...

Create a Glue Crawler with the following configuration.

Enable 'Add new columns only’ - This adds new columns as they are discovered, but doesn't remove or change the type of existing columns in the Data Catalog

Enable 'Update all new and existing partitions with metadata from the table’ - this option inherits metadata properties such as their classification, input format, output format, SerDe information, and schema from their parent table. Any changes to these properties in a table are propagated to its partitions.
Run the crawler to create the table, it will create a table with "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe" - Edit this to the "org.apache.hadoop.hive.serde2.OpenCSVSerde".
Re-run the crawler.
In case a new partition is added on crawler re-run, it will also be created with “org.apache.hadoop.hive.serde2.OpenCSVSerde”.
You should now have a table that is set to org.apache.hadoop.hive.serde2.OpenCSVSerde and does not reset.

来源：https://stackoverflow.com/questions/57498330/specify-a-serde-serialization-lib-with-aws-glue-crawler

标签

amazon-web-services

amazon-athena

aws-glue

aws-glue-data-catalog