AWS Glue job consuming data from external REST API

问题

I'm trying to create a workflow where AWS Glue ETL job will pull the JSON data from external REST API instead of S3 or any other AWS-internal sources. Is that even possible? Anyone does it? Please help!

回答1:

Yes, I do extract data from REST API's like Twitter, FullStory, Elasticsearch, etc. Usually, I do use the Python Shell jobs for the extraction because they are faster (relatively small cold start). When is finished it triggers a Spark type job that reads only the json items I need. I use the requests pyhton library.

In order to save the data into S3 you can do something like this

import boto3
import json

# Initializes S3 client
s3 = boto3.resource('s3')

tweets = []
//Code that extracts tweets from API
tweets_json = json.dumps(tweets)
obj = s3.Object("my-tweets", "tweets.json")
obj.put(Body=data)

回答2:

The AWS Glue Python Shell executor has a limit of 1 DPU max. If that's an issue, like in my case, a solution could be running the script in ECS as a task.

You can run about 150 requests/second using libraries like asyncio and aiohttp in python. example 1, example 2.

Then you can distribute your request across multiple ECS tasks or Kubernetes pods using Ray. Here you can find a few examples of what Ray can do for you.

This also allows you to cater for APIs with rate limiting.

Once you've gathered all the data you need, run it through AWS Glue.

来源：https://stackoverflow.com/questions/59714187/aws-glue-job-consuming-data-from-external-rest-api

标签

aws-glue

aws-glue-data-catalog

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!