aws-glue-data-catalog | 易学教程

How to access run-property of AWS Glue workflow in Glue job?

阅读更多关于 How to access run-property of AWS Glue workflow in Glue job?

问题 I have been working with AWS Glue workflow for orchestrating batch jobs. we need to pass push-down-predicate in order to limit the processing for batch job. When we run Glue jobs alone, we can pass push down predicates as a command line argument at run time (i.e. aws glue start-job-run --job-name foo.scala --arguments --arg1-text ${arg1}..). But when we use glue workflow to execute Glue jobs, it is bit unclear. When we orchestrate Batch jobs using AWS Glue workflows, we can add run properties

Specify a SerDe serialization lib with AWS Glue Crawler

阅读更多关于 Specify a SerDe serialization lib with AWS Glue Crawler

问题 Every time I run a glue crawler on existing data, it changes the Serde serialization lib to LazySimpleSerDe , which doesn't classify correctly (e.g. for quoted fields with commas in) I then need to manually edit the table details in the Glue Catalog to change it to org.apache.hadoop.hive.serde2.OpenCSVSerde . I've tried making my own csv Classifier but that doesn't help. How do I get the crawler to specify a particular serialization lib for the tables produced or updated? 回答1: You can't

AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

阅读更多关于 AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

问题 Problem Statement/Root Cause: We are using AWS Glue to load data from a production PostGress DB into AWS DataLake. Glue internally uses Spark job to move the data. Our ETL process is, however, failing as Spark only supports lowercase table column names and unfortunately, all our source PostGress table column names are in CamelCase and enclosed in double-quotes. E.g. : Our Source table column name in the PostGress DB is "CreatedDate". The Spark job query is looking for createddate and is

AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

阅读更多关于 AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

阅读更多关于 AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

How to access data in subdirectories for partitioned Athena table

阅读更多关于 How to access data in subdirectories for partitioned Athena table

问题 I have an Athena table with a partition for each day, where the actual files are in "sub-directories" by hour, as follows: s3://my-bucket/data/2019/06/27/00/00001.json s3://my-bucket/data/2019/06/27/00/00002.json s3://my-bucket/data/2019/06/27/01/00001.json s3://my-bucket/data/2019/06/27/01/00002.json Athena is able to query this table without issue and find my data, but when using AWS Glue, it does not appear to be able to find this data. ALTER TABLE mytable ADD PARTITION (year=2019, month

How to access data in subdirectories for partitioned Athena table

阅读更多关于 How to access data in subdirectories for partitioned Athena table

AWS Glue job consuming data from external REST API

阅读更多关于 AWS Glue job consuming data from external REST API

问题 I'm trying to create a workflow where AWS Glue ETL job will pull the JSON data from external REST API instead of S3 or any other AWS-internal sources. Is that even possible? Anyone does it? Please help! 回答1: Yes, I do extract data from REST API's like Twitter, FullStory, Elasticsearch, etc. Usually, I do use the Python Shell jobs for the extraction because they are faster (relatively small cold start). When is finished it triggers a Spark type job that reads only the json items I need. I use

AWS Glue job consuming data from external REST API

阅读更多关于 AWS Glue job consuming data from external REST API

Dynamic frame resolve choice specs , date cast

阅读更多关于 Dynamic frame resolve choice specs , date cast

问题 I am writing a Glue code and using dynamic frame Api resolve choice , specs . I am trying to cast the source by passing casting when dynamic frame is created from catalog. I have successfully implemented the casting via resolve choice specs but while casting date i am getting null values , just wanted to understand how can we pass date with source format in casting. self.df_TR01=self.df_TR01.resolveChoice(specs=[('col1', 'cast"string'), ('col2_date', 'cast:date')]).toDF() But in col2_date i