Wait until AWS Glue crawler has finished running

问题

In the documentation, I cannot find any way of checking the run status of a crawler. The only way I am doing it currently is constantly checking AWS to check if the file/table has been created.

Is there a better way to block until crawler finishes its run?

回答1:

You can use boto3 (or similar) to do it. There is the get_crawler method. You will find needed information in "LastCrawl" section

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.get_crawler

回答2:

The following function uses boto3. It starts the AWS Glue crawler and waits until its completion. It also logs the status as it progresses. It was tested with Python v3.8 with boto3 v1.17.3.

import logging
import time
import timeit

import boto3

log = logging.getLogger(__name__)


def glue_crawl(crawler: str, *, timeout_minutes: int = 60, retry_seconds: int = 5) -> None:
    """Crawl the specified AWS Glue crawler, waiting until completion."""
    # Ref: https://stackoverflow.com/a/66072347/
    timeout_seconds = timeout_minutes * 60
    client = boto3.client("glue")
    start_time = timeit.default_timer()
    abort_time = start_time + timeout_seconds

    def check_for_timeout() -> None:
        if timeit.default_timer() > abort_time:
            raise TimeoutError(f"Failed to crawl {crawler}. The allocated time of {timeout_minutes:,} minutes has elapsed.")

    def wait_until_ready() -> None:
        state_previous = None
        while True:
            response_get = client.get_crawler(Name=crawler)
            state = response_get["Crawler"]["State"]
            if state != state_previous:
                log.info(f"Crawler {crawler} is {state.lower()}.")
                state_previous = state
            if state == "READY":  # Other known states: RUNNING, STOPPING
                return
            check_for_timeout()
            time.sleep(retry_seconds)

    wait_until_ready()
    response_start = client.start_crawler(Name=crawler)
    assert response_start["ResponseMetadata"]["HTTPStatusCode"] == 200
    log.info(f"Crawling {crawler}.")
    wait_until_ready()
    log.info(f"Crawled {crawler}.")

来源：https://stackoverflow.com/questions/52996591/wait-until-aws-glue-crawler-has-finished-running

标签

boto3

aws-glue