aws-glue | 易学教程

Input data to AWS Elastic Search using Glue

阅读更多关于 Input data to AWS Elastic Search using Glue

问题 I'm looking for a solution to insert data to AWS Elastic Search using AWS Glue python or pyspark. I have seen Boto3 SDK for Elastic Search but could not find any function to insert data into Elastic Search. Can anyone help me to find solution ? Any useful links or code ? 回答1: For aws glue you need to add an additional jar to the job. Download the jar from https://repo1.maven.org/maven2/org/elasticsearch/elasticsearch-hadoop/7.8.0/elasticsearch-hadoop-7.8.0.jar Save the jar on s3 and pass it

aws glue to access/crawl dynamodb from another aws account (cross account access)

阅读更多关于 aws glue to access/crawl dynamodb from another aws account (cross account access)

问题 I have written a glue job which exports DynamoDb table and stores it on S3 in csv format. The glue job and the table are in the same aws account, but the S3 bucket is in a different aws account. I have been able to access cross account S3 bucket from the glue job by attaching the following bucket policy to it. { "Version": "2012-10-17", "Statement": [ { "Sid": "tempS3Access", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::<AWS-ACCOUNT-ID>:role/<ROLE-PATH>" }, "Action": [ "s3:Get*",

How to configure Spark / Glue to avoid creation of empty $_folder_$ after Glue job successful execution

阅读更多关于 How to configure Spark / Glue to avoid creation of empty $_folder_$ after Glue job successful execution

问题 I have a simple glue etl job which is triggered by Glue workflow. It drop duplicates data from a crawler table and writes back the result into a S3 bucket. The job is completed successfully . However the empty folders that spark generates "$ folder $" remain in s3. It does not look nice in the hierarchy and causes confusion. Is there any way to configure spark or glue context to hide/remove these folders after successful completion of the job? ---------------------S3 image -------------------

Filtering DynamicFrame with AWS Glue or PySpark

阅读更多关于 Filtering DynamicFrame with AWS Glue or PySpark

问题 I have a table in my AWS Glue Data Catalog called 'mytable'. This table is in an on-premises Oracle database connection 'mydb'. I'd like to filter the resulting DynamicFrame to only rows where the X_DATETIME_INSERT column (which is a timestamp) is greater than a certain time (in this case, '2018-05-07 04:00:00'). Afterwards, I'm trying to count the rows to ensure that the count is low (the table is about 40,000 rows, but only a few rows should meet the filter criteria). Here is my current

Filtering DynamicFrame with AWS Glue or PySpark

阅读更多关于 Filtering DynamicFrame with AWS Glue or PySpark

Run Crawler using custom resource Lambda

阅读更多关于 Run Crawler using custom resource Lambda

问题 I am trying to create and invoke an AWS Glue crawler using cloud formation. The creation part of the crawler(dynamo DB as target) is in lambda function. how can I achieve all this using cloud formation? i.e creation of lambda function from cod present in s3 , After lambda function is created, it should get triggered to create crawler and then crawler should be invoked to create targted tables. I want all of this is cloud formation. link for reference: Is it possible to trigger a lambda on

Run Crawler using custom resource Lambda

阅读更多关于 Run Crawler using custom resource Lambda

Run Crawler using custom resource Lambda

阅读更多关于 Run Crawler using custom resource Lambda

How to use a CloudWatch custom log group with Python Shell Glue job?

阅读更多关于 How to use a CloudWatch custom log group with Python Shell Glue job?

问题 I have some "Python Shell" type Glue jobs and I want to send the job logs to a custom CloudWatch log group instead of the default log group. I am able to achieve this for "Spark" type glue jobs by providing job parameters as below: "--enable-continuous-cloudwatch-log" = true "--continuous-log-logGroup" = "/aws-glue/jobs/glue-job-1" but the same parameters doesn't work for Python Shell jobs (logs still going to the default log groups /aws-glue/python-jobs/output and /aws-glue/python-jobs/error

How to use a CloudWatch custom log group with Python Shell Glue job?

阅读更多关于 How to use a CloudWatch custom log group with Python Shell Glue job?