aws-glue

How can I use an external python library in AWS Glue?

与世无争的帅哥 提交于 2021-02-07 03:59:32
问题 First stack overflow question here. Hope I do this correctly: I need to use an external python library in AWS glue. "Openpyxl" is the name of the library. I follow these directions: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html However, after I have my zip file saved in the correct s3 location and point my glue job to that location, I'm not sure what to actually write in the script. I tried your typical Import openpyxl , but that just returns the

Querying optional nested JSON fields in Athena

混江龙づ霸主 提交于 2021-02-05 09:25:10
问题 I have json data that looks something like: { "col1" : 123, "metadata" : { "opt1" : 456, "opt2" : 789 } } where the various metadata fields (of which there are many) are optional and may or may not be present. My query is: select col1, metadata.opt1 from "db-name".tablename If opt1 is not present in any rows, I would expect this to return all rows with a blank for the opt1 column, but if there wasn't a row with the opt1 in metadata when the crawler ran (and might still not be present in data

How to import referenced files in ETL scripts?

被刻印的时光 ゝ 提交于 2021-02-05 07:11:32
问题 I have a script which I'd like to pass a configuration file into. On the Glue jobs page, I see that there is a "Referenced files path" which points to my configuration file. How do I then use that file within my ETL script? I've tried from configuration import * , where the referenced file name is configuration.py , but no luck (ImportError: No module named configuration). 回答1: I noticed the same issue. I believe there is already a ticket to address it, but here is what AWS support suggests

Connection timeout when reading Netezza from AWS Glue

倾然丶 夕夏残阳落幕 提交于 2021-01-29 19:11:01
问题 I am trying to use AWS Glue for pulling data from my on-premise Netezza database into S3. The code I have written so far (not complete) df = glueContext.read.format("jdbc")\ .option("driver", "org.netezza.Driver")\ .option("url", "jdbc:netezza://NetezzaHost01:5480/Netezza_DB")\ .option("dbtable", "ADMIN.table1")\ .option("user", "myUser")\ .option("password", "myPassword")\ .load() print(df.count()) I am using a custom JDBC driver jar since AWS Glue does not support Netezza natively (the

How To write the ETL job to transfer the mysql database table to another mysql rds database

孤者浪人 提交于 2021-01-29 16:23:42
问题 I am new to AWS. I want to write the ETL script using AWS Glue to transfer the data from one mysql database to another RDS mysql database . Please suggest me to how to do this job using AWS glue Thanks 回答1: You can use pymysql or mysql.connector as a seperate zip file added to the glue job. We have used pymysql for all our production jobs running in AWS Glue/Aurora RDS Use this connectors to connect to both the RDS Mysql instances. Read data from RDS Source db1 into a dataframe, perform the

Extract Embedded AWS Glue Connection Credentials Using Scala

我的未来我决定 提交于 2021-01-29 14:17:51
问题 I have a glue job that reads directly from redshift, and to do that, one has to provide connection credentials. I have created an embedded glue connection and can extract the credentials with the following pyspark code. Is there a way to do this in Scala ? glue = boto3.client('glue', region_name='us-east-1') response = glue.get_connection( Name='name-of-embedded-connection', HidePassword=False ) table = spark.read.format( 'com.databricks.spark.redshift' ).option( 'url', 'jdbc:redshift://prod

spark.sql.files.maxPartitionBytes not limiting max size of written partitions

≯℡__Kan透↙ 提交于 2021-01-29 10:07:27
问题 I'm trying to copy parquet data from another s3 bucket to my s3 bucket. I want to limit the size of each partition to a max of 128 MB. I thought by default spark.sql.files.maxPartitionBytes would have been set to 128 MB, but when I look at the partition files in s3 after my copy I see individual partition files around 226 MB instead. I was looking at this post which suggested that I set this spark config key in order to limit the max size of my partitions: Limiting maximum size of dataframe

Can you have permanent IP address with AWS Glue so that it can be whitelisted in Snowflake?

你离开我真会死。 提交于 2021-01-29 10:00:29
问题 The scenario is this: Our snowflake will only be accessible by whitelisted IP addresses. If we plan to use AWS Glue what IP address can we use so that it will allow us to connect to snowflake? I need a way to identify that this AWS Glue job with IP address (endpoint) so that it can be identified in Snowflake. I want to use an AWS Glue because it is a serverless orchestration tool. Thanks, D. 回答1: AWS has specified the ip-ranges of several services and regions, but Glue is currently not listed

How to relationalize JSON containing arrays

假如想象 提交于 2021-01-28 14:09:14
问题 I am using AWS Glue to read data file containing JSON (on S3). This one is a JSON with data contained in array. I have tried using relationalize() function but it doesn't work on array. It does work on nested JSON but this is not the data format of input. Is there a way to relationalize JSON with arrays in it? Input data: { "ID":"1234", "territory":"US", "imgList":[ { "type":"box" "locale":"en-US" "url":"boxart/url.jpg" }, { "type":"square" "locale":"en-US" "url":"square/url.jpg" } ] } Code:

How to access run-property of AWS Glue workflow in Glue job?

风流意气都作罢 提交于 2021-01-28 11:15:02
问题 I have been working with AWS Glue workflow for orchestrating batch jobs. we need to pass push-down-predicate in order to limit the processing for batch job. When we run Glue jobs alone, we can pass push down predicates as a command line argument at run time (i.e. aws glue start-job-run --job-name foo.scala --arguments --arg1-text ${arg1}..). But when we use glue workflow to execute Glue jobs, it is bit unclear. When we orchestrate Batch jobs using AWS Glue workflows, we can add run properties