问题
I'm trying to create a workflow where AWS Glue ETL job will pull the JSON data from external REST API instead of S3 or any other AWS-internal sources. Is that even possible? Anyone does it? Please help!
回答1:
Yes, I do extract data from REST API's like Twitter, FullStory, Elasticsearch, etc. Usually, I do use the Python Shell jobs for the extraction because they are faster (relatively small cold start). When is finished it triggers a Spark type job that reads only the json items I need. I use the requests pyhton library.
In order to save the data into S3 you can do something like this
import boto3
import json
# Initializes S3 client
s3 = boto3.resource('s3')
tweets = []
//Code that extracts tweets from API
tweets_json = json.dumps(tweets)
obj = s3.Object("my-tweets", "tweets.json")
obj.put(Body=data)
回答2:
The AWS Glue Python Shell executor has a limit of 1 DPU max. If that's an issue, like in my case, a solution could be running the script in ECS as a task.
You can run about 150 requests/second using libraries like asyncio and aiohttp in python. example 1, example 2.
Then you can distribute your request across multiple ECS tasks or Kubernetes pods using Ray. Here you can find a few examples of what Ray can do for you.
This also allows you to cater for APIs with rate limiting.
Once you've gathered all the data you need, run it through AWS Glue.
来源:https://stackoverflow.com/questions/59714187/aws-glue-job-consuming-data-from-external-rest-api