Missing log lines when writing to cloudwatch from ECS Docker containers

后端 未结 4 1069
暗喜
暗喜 2021-02-08 10:12

(Docker container on AWS-ECS exits before all the logs are printed to CloudWatch Logs) Why are some streams of a CloudWatch Logs Group incomplete (i.e., the Fargate Docker Conta

4条回答
  •  遇见更好的自我
    2021-02-08 10:21

    UPDATE: This now appears to be fixed, so there is no need to implement the workaround described below


    I've seen the same behaviour when using ECS Fargate containers to run Python scripts - and had the same resulting frustration!

    I think it's due to CloudWatch Logs Agent publishing log events in batches:

    How are log events batched?

    A batch becomes full and is published when any of the following conditions are met:

    1. The buffer_duration amount of time has passed since the first log event was added.

    2. Less than batch_size of log events have been accumulated but adding the new log event exceeds the batch_size.

    3. The number of log events has reached batch_count.

    4. Log events from the batch don't span more than 24 hours, but adding the new log event exceeds the 24 hours constraint.

    (Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AgentReference.html)

    So a possible explanation is that log events are buffered by the agent but not yet published when the ECS task is stopped. (And if so, that seems like an ECS issue - any AWS ECS engineers willing to give their perspective on this...?)

    There doesn't seem to be a direct way to ensure the logs are published, but it does suggest one could wait at least buffer_duration seconds (by default, 5 seconds), and any prior logs should be published.

    With a bit of testing that I'll describe below, here's a workaround I landed on. A shell script run_then_wait.sh wraps the command to trigger the Python script, to add a sleep after the script completes.

    Dockerfile

    FROM python:3.7-alpine
    ADD run_then_wait.sh .
    ADD main.py .
    
    # The original command
    # ENTRYPOINT ["python", "main.py"]
    
    # To run the original command and then wait
    ENTRYPOINT ["sh", "run_then_wait.sh", "python", "main.py"]
    

    run_then_wait.sh

    #!/bin/sh
    set -e
    
    # Wait 10 seconds on exit: twice the `buffer_duration` default of 5 seconds
    trap 'echo "Waiting for logs to flush to CloudWatch Logs..."; sleep 10' EXIT
    
    # Run the given command
    "$@"
    

    main.py

    import logging
    import time
    
    logging.basicConfig(level=logging.INFO)
    logger = logging.getLogger()
    
    if __name__ == "__main__":
        # After testing some random values, had most luck to induce the
        # issue by sleeping 9 seconds here; would occur ~30% of the time
        time.sleep(9)
        logger.info("Hello world")
    

    Hopefully the approach can be adapted to your situation. You could also implement the sleep inside your script, but it can be trickier to ensure it happens regardless of how it terminates.

    It's hard to prove that the proposed explanation is accurate, so I used the above code to test whether the workaround was effective. The test was the original command vs. with run_then_wait.sh, 30 runs each. The results were that the issue was observed 30% of the time, vs 0% of the time, respectively. Hope this is similarly effective for you!

提交回复
热议问题