AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

拥有回忆 提交于 2020-12-31 20:07:44

问题


Problem Statement/Root Cause: We are using AWS Glue to load data from a production PostGress DB into AWS DataLake. Glue internally uses Spark job to move the data. Our ETL process is, however, failing as Spark only supports lowercase table column names and unfortunately, all our source PostGress table column names are in CamelCase and enclosed in double-quotes.

E.g. : Our Source table column name in the PostGress DB is "CreatedDate". The Spark job query is looking for createddate and is failing because it can't find the column name. So, the spark job query needs to look for exactly "CreatedDate" to be able to move data from the PostGress DB. This seems to be an inherent limitation of both Spark (as it only supports lowercase table column names) and PostGress (Column names that were created with double-quotes have to be double-quoted for the rest of their life).

Reference Links: https://docs.aws.amazon.com/athena/latest/ug/tables-databases-columns-names.html Are PostgreSQL column names case-sensitive?

Solutions evaluated: 1. We will not be able to rename the column names from CamelCase to lowercase as that will necessitate a bigger change in all downstream systems. 2. We are trying to rewrite/tweak Glue's auto-generated Spark code to see if we can get it to work with double-quoted, non-lowercase source table column names.

Has anyone run into this issue before and have you tried to tweak the auto-generated Spark code to get it working?


回答1:


Solution 1: If you are using scala and glue dynamic frame, you can use applyMapping(). Default value for caseSensitive is true. Check https://docs.aws.amazon.com/glue/latest/dg/glue-etl-scala-apis-glue-dynamicframe-class.html#glue-etl-scala-apis-glue-dynamicframe-class-defs-applyMapping

Solution 2: if you are using pyspark dataframe in python, you can set conf:

spark_session.sql('set spark.sql.caseSensitive=true')



回答2:


Sandeep Fatangare thanks for you suggesting.

I am very new to AWS Glue I don't know whether I'm doing correctly. Please guide me if I'm wrong.

I try editing the script by navigating to

AWS Glue -> Jobs and choose the failed Job script

In the details tab, it shows the location "location mention in the job details is s3://aws-glue-assets-us-east-1/scripts/glueetl/jdbc_incremental.py".

And in Script Tab I start editing the script

previous :

applymapping1 = ApplyMapping.apply(frame=datasource0, mappings=self.get_mappings(),                                                                                      transformation_ctx="applymapping1_" + self.source.table_name)

Edited :

applymapping1 = ApplyMapping.apply(frame=datasource0, mappings=self.get_mappings(),
                                           caseSensitive : Boolean = false, 
                                           transformation_ctx="applymapping1_" + self.source.table_name)

And I faced 2 problems

  1. I cant able to save the edited script
  2. And while running the script it told me workflow name is missing


来源:https://stackoverflow.com/questions/58093109/aws-glue-spark-job-fails-to-support-upper-case-column-name-with-double-quotes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!