aws-glue-data-catalog

How to access run-property of AWS Glue workflow in Glue job?

风流意气都作罢 提交于 2021-01-28 11:15:02
问题 I have been working with AWS Glue workflow for orchestrating batch jobs. we need to pass push-down-predicate in order to limit the processing for batch job. When we run Glue jobs alone, we can pass push down predicates as a command line argument at run time (i.e. aws glue start-job-run --job-name foo.scala --arguments --arg1-text ${arg1}..). But when we use glue workflow to execute Glue jobs, it is bit unclear. When we orchestrate Batch jobs using AWS Glue workflows, we can add run properties

Specify a SerDe serialization lib with AWS Glue Crawler

守給你的承諾、 提交于 2021-01-02 20:09:22
问题 Every time I run a glue crawler on existing data, it changes the Serde serialization lib to LazySimpleSerDe , which doesn't classify correctly (e.g. for quoted fields with commas in) I then need to manually edit the table details in the Glue Catalog to change it to org.apache.hadoop.hive.serde2.OpenCSVSerde . I've tried making my own csv Classifier but that doesn't help. How do I get the crawler to specify a particular serialization lib for the tables produced or updated? 回答1: You can't

AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

谁都会走 提交于 2020-12-31 20:17:46
问题 Problem Statement/Root Cause: We are using AWS Glue to load data from a production PostGress DB into AWS DataLake. Glue internally uses Spark job to move the data. Our ETL process is, however, failing as Spark only supports lowercase table column names and unfortunately, all our source PostGress table column names are in CamelCase and enclosed in double-quotes. E.g. : Our Source table column name in the PostGress DB is "CreatedDate". The Spark job query is looking for createddate and is

AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

拥有回忆 提交于 2020-12-31 20:07:44
问题 Problem Statement/Root Cause: We are using AWS Glue to load data from a production PostGress DB into AWS DataLake. Glue internally uses Spark job to move the data. Our ETL process is, however, failing as Spark only supports lowercase table column names and unfortunately, all our source PostGress table column names are in CamelCase and enclosed in double-quotes. E.g. : Our Source table column name in the PostGress DB is "CreatedDate". The Spark job query is looking for createddate and is

AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

扶醉桌前 提交于 2020-12-31 20:01:17
问题 Problem Statement/Root Cause: We are using AWS Glue to load data from a production PostGress DB into AWS DataLake. Glue internally uses Spark job to move the data. Our ETL process is, however, failing as Spark only supports lowercase table column names and unfortunately, all our source PostGress table column names are in CamelCase and enclosed in double-quotes. E.g. : Our Source table column name in the PostGress DB is "CreatedDate". The Spark job query is looking for createddate and is

How to access data in subdirectories for partitioned Athena table

懵懂的女人 提交于 2020-12-12 18:50:12
问题 I have an Athena table with a partition for each day, where the actual files are in "sub-directories" by hour, as follows: s3://my-bucket/data/2019/06/27/00/00001.json s3://my-bucket/data/2019/06/27/00/00002.json s3://my-bucket/data/2019/06/27/01/00001.json s3://my-bucket/data/2019/06/27/01/00002.json Athena is able to query this table without issue and find my data, but when using AWS Glue, it does not appear to be able to find this data. ALTER TABLE mytable ADD PARTITION (year=2019, month

How to access data in subdirectories for partitioned Athena table

試著忘記壹切 提交于 2020-12-12 18:49:15
问题 I have an Athena table with a partition for each day, where the actual files are in "sub-directories" by hour, as follows: s3://my-bucket/data/2019/06/27/00/00001.json s3://my-bucket/data/2019/06/27/00/00002.json s3://my-bucket/data/2019/06/27/01/00001.json s3://my-bucket/data/2019/06/27/01/00002.json Athena is able to query this table without issue and find my data, but when using AWS Glue, it does not appear to be able to find this data. ALTER TABLE mytable ADD PARTITION (year=2019, month

AWS Glue job consuming data from external REST API

大城市里の小女人 提交于 2020-12-06 07:30:11
问题 I'm trying to create a workflow where AWS Glue ETL job will pull the JSON data from external REST API instead of S3 or any other AWS-internal sources. Is that even possible? Anyone does it? Please help! 回答1: Yes, I do extract data from REST API's like Twitter, FullStory, Elasticsearch, etc. Usually, I do use the Python Shell jobs for the extraction because they are faster (relatively small cold start). When is finished it triggers a Spark type job that reads only the json items I need. I use

AWS Glue job consuming data from external REST API

做~自己de王妃 提交于 2020-12-06 07:28:08
问题 I'm trying to create a workflow where AWS Glue ETL job will pull the JSON data from external REST API instead of S3 or any other AWS-internal sources. Is that even possible? Anyone does it? Please help! 回答1: Yes, I do extract data from REST API's like Twitter, FullStory, Elasticsearch, etc. Usually, I do use the Python Shell jobs for the extraction because they are faster (relatively small cold start). When is finished it triggers a Spark type job that reads only the json items I need. I use

Dynamic frame resolve choice specs , date cast

丶灬走出姿态 提交于 2020-08-26 13:44:51
问题 I am writing a Glue code and using dynamic frame Api resolve choice , specs . I am trying to cast the source by passing casting when dynamic frame is created from catalog. I have successfully implemented the casting via resolve choice specs but while casting date i am getting null values , just wanted to understand how can we pass date with source format in casting. self.df_TR01=self.df_TR01.resolveChoice(specs=[('col1', 'cast"string'), ('col2_date', 'cast:date')]).toDF() But in col2_date i