AWS Glue: Rename_field() does not work after relationalize

微笑、不失礼 提交于 2021-02-10 15:00:58

问题


I got a job that needs to perform the following task

  1. Relationalize the data
  2. Rename the field names that contains the '.'s so that it can be imported into PostgreSQL as normal looking field name.

Here is the code

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "gluecatalog", table_name = "fcorders", transformation_ctx = "datasource0")
rootTableName = 'orders' 

dfc = Relationalize.apply(frame = datasource0, staging_path = "s3://my-bucket/temp/", name = rootTableName, transformation_ctx = "dfc")
dfc.keys()
for df_name in dfc.keys():
        m_df = dfc.select(df_name)
        print "Writing to Postgre table: ", df_name
        if (df_name <> rootTableName):
            renamefields4 = m_df.rename_field("SalesDeliveryLines.val.shipped.unitDisplayCode", "shipped_unitDisplayCode")
        else:
            renamefields4 = RenameField.apply(frame = m_df, old_name = "vehicle.sourceReccordUID", new_name = "vehicle_sourceReccordUID", transformation_ctx = "renamefields4")
        renamefields4.printSchema()

The printSchema() displays the schema as unchanged. If I write to the database, the field names still contain '.'s.

If I uses ApplyMapping.apply() to change the field name before relationalize, it makes the child table disappear. If I use ApplyMapping.apply() after relationalize it simply deletes all the fields whose name contains '.'.

The bottom line is I cannot relationalize and rename field in the same job no matter what I try.

Did I miss something or is this a AWS Glue bug?


回答1:


It is confirmed the malfunction of rename_field() and RenameField.apply() is a Glue bug.

The work-around I have so far is to convert DynamicFrame to DataFrame -> rename the fields DataFrame -> Convert it back to DynamicFrame.

Here is the code

    new_df = m_df.toDF()
    print (type( new_df))
    for oldName in new_df.schema.names:
      new_df = new_df.withColumnRenamed(oldName, oldName.replace("SalesDeliveryLines.val.","").replace(".","_"))
    m_df = m_df.fromDF(new_df, glueContext, "m_df")



回答2:


You need to place back-ticks around the field name:

m_df.rename_field("`SalesDeliveryLines.val.shipped.unitDisplayCode`", "shipped_unitDisplayCode")

You could find more info on the AWS Glue documentation



来源:https://stackoverflow.com/questions/45970673/aws-glue-rename-field-does-not-work-after-relationalize

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!