问题
In the documentation, I cannot find any way of checking the run status of a crawler. The only way I am doing it currently is constantly checking AWS to check if the file/table has been created.
Is there a better way to block until crawler finishes its run?
回答1:
You can use boto3 (or similar) to do it. There is the get_crawler method. You will find needed information in "LastCrawl" section
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.get_crawler
回答2:
The following function uses boto3
. It starts the AWS Glue crawler and waits until its completion. It also logs the status as it progresses. It was tested with Python v3.8 with boto3 v1.17.3.
import logging
import time
import timeit
import boto3
log = logging.getLogger(__name__)
def glue_crawl(crawler: str, *, timeout_minutes: int = 60, retry_seconds: int = 5) -> None:
"""Crawl the specified AWS Glue crawler, waiting until completion."""
# Ref: https://stackoverflow.com/a/66072347/
timeout_seconds = timeout_minutes * 60
client = boto3.client("glue")
start_time = timeit.default_timer()
abort_time = start_time + timeout_seconds
def check_for_timeout() -> None:
if timeit.default_timer() > abort_time:
raise TimeoutError(f"Failed to crawl {crawler}. The allocated time of {timeout_minutes:,} minutes has elapsed.")
def wait_until_ready() -> None:
state_previous = None
while True:
response_get = client.get_crawler(Name=crawler)
state = response_get["Crawler"]["State"]
if state != state_previous:
log.info(f"Crawler {crawler} is {state.lower()}.")
state_previous = state
if state == "READY": # Other known states: RUNNING, STOPPING
return
check_for_timeout()
time.sleep(retry_seconds)
wait_until_ready()
response_start = client.start_crawler(Name=crawler)
assert response_start["ResponseMetadata"]["HTTPStatusCode"] == 200
log.info(f"Crawling {crawler}.")
wait_until_ready()
log.info(f"Crawled {crawler}.")
来源:https://stackoverflow.com/questions/52996591/wait-until-aws-glue-crawler-has-finished-running