aws-glue

Input data to AWS Elastic Search using Glue

被刻印的时光 ゝ 提交于 2021-01-28 09:03:01
问题 I'm looking for a solution to insert data to AWS Elastic Search using AWS Glue python or pyspark. I have seen Boto3 SDK for Elastic Search but could not find any function to insert data into Elastic Search. Can anyone help me to find solution ? Any useful links or code ? 回答1: For aws glue you need to add an additional jar to the job. Download the jar from https://repo1.maven.org/maven2/org/elasticsearch/elasticsearch-hadoop/7.8.0/elasticsearch-hadoop-7.8.0.jar Save the jar on s3 and pass it

aws glue to access/crawl dynamodb from another aws account (cross account access)

独自空忆成欢 提交于 2021-01-28 06:56:15
问题 I have written a glue job which exports DynamoDb table and stores it on S3 in csv format. The glue job and the table are in the same aws account, but the S3 bucket is in a different aws account. I have been able to access cross account S3 bucket from the glue job by attaching the following bucket policy to it. { "Version": "2012-10-17", "Statement": [ { "Sid": "tempS3Access", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::<AWS-ACCOUNT-ID>:role/<ROLE-PATH>" }, "Action": [ "s3:Get*",

How to configure Spark / Glue to avoid creation of empty $_folder_$ after Glue job successful execution

ⅰ亾dé卋堺 提交于 2021-01-24 13:47:41
问题 I have a simple glue etl job which is triggered by Glue workflow. It drop duplicates data from a crawler table and writes back the result into a S3 bucket. The job is completed successfully . However the empty folders that spark generates "$ folder $" remain in s3. It does not look nice in the hierarchy and causes confusion. Is there any way to configure spark or glue context to hide/remove these folders after successful completion of the job? ---------------------S3 image -------------------

Filtering DynamicFrame with AWS Glue or PySpark

故事扮演 提交于 2021-01-21 11:45:09
问题 I have a table in my AWS Glue Data Catalog called 'mytable'. This table is in an on-premises Oracle database connection 'mydb'. I'd like to filter the resulting DynamicFrame to only rows where the X_DATETIME_INSERT column (which is a timestamp) is greater than a certain time (in this case, '2018-05-07 04:00:00'). Afterwards, I'm trying to count the rows to ensure that the count is low (the table is about 40,000 rows, but only a few rows should meet the filter criteria). Here is my current

Filtering DynamicFrame with AWS Glue or PySpark

落花浮王杯 提交于 2021-01-21 11:45:05
问题 I have a table in my AWS Glue Data Catalog called 'mytable'. This table is in an on-premises Oracle database connection 'mydb'. I'd like to filter the resulting DynamicFrame to only rows where the X_DATETIME_INSERT column (which is a timestamp) is greater than a certain time (in this case, '2018-05-07 04:00:00'). Afterwards, I'm trying to count the rows to ensure that the count is low (the table is about 40,000 rows, but only a few rows should meet the filter criteria). Here is my current

Run Crawler using custom resource Lambda

久未见 提交于 2021-01-07 16:31:51
问题 I am trying to create and invoke an AWS Glue crawler using cloud formation. The creation part of the crawler(dynamo DB as target) is in lambda function. how can I achieve all this using cloud formation? i.e creation of lambda function from cod present in s3 , After lambda function is created, it should get triggered to create crawler and then crawler should be invoked to create targted tables. I want all of this is cloud formation. link for reference: Is it possible to trigger a lambda on

Run Crawler using custom resource Lambda

£可爱£侵袭症+ 提交于 2021-01-07 16:27:56
问题 I am trying to create and invoke an AWS Glue crawler using cloud formation. The creation part of the crawler(dynamo DB as target) is in lambda function. how can I achieve all this using cloud formation? i.e creation of lambda function from cod present in s3 , After lambda function is created, it should get triggered to create crawler and then crawler should be invoked to create targted tables. I want all of this is cloud formation. link for reference: Is it possible to trigger a lambda on

Run Crawler using custom resource Lambda

左心房为你撑大大i 提交于 2021-01-07 16:25:23
问题 I am trying to create and invoke an AWS Glue crawler using cloud formation. The creation part of the crawler(dynamo DB as target) is in lambda function. how can I achieve all this using cloud formation? i.e creation of lambda function from cod present in s3 , After lambda function is created, it should get triggered to create crawler and then crawler should be invoked to create targted tables. I want all of this is cloud formation. link for reference: Is it possible to trigger a lambda on

How to use a CloudWatch custom log group with Python Shell Glue job?

自作多情 提交于 2021-01-04 03:21:34
问题 I have some "Python Shell" type Glue jobs and I want to send the job logs to a custom CloudWatch log group instead of the default log group. I am able to achieve this for "Spark" type glue jobs by providing job parameters as below: "--enable-continuous-cloudwatch-log" = true "--continuous-log-logGroup" = "/aws-glue/jobs/glue-job-1" but the same parameters doesn't work for Python Shell jobs (logs still going to the default log groups /aws-glue/python-jobs/output and /aws-glue/python-jobs/error

How to use a CloudWatch custom log group with Python Shell Glue job?

此生再无相见时 提交于 2021-01-04 03:17:36
问题 I have some "Python Shell" type Glue jobs and I want to send the job logs to a custom CloudWatch log group instead of the default log group. I am able to achieve this for "Spark" type glue jobs by providing job parameters as below: "--enable-continuous-cloudwatch-log" = true "--continuous-log-logGroup" = "/aws-glue/jobs/glue-job-1" but the same parameters doesn't work for Python Shell jobs (logs still going to the default log groups /aws-glue/python-jobs/output and /aws-glue/python-jobs/error