My AWS Cloudwatch bill is huge. How do I work out which log stream is causing it?

问题

I got a $1,200 invoice from Amazon for Cloudwatch services last month (specifically for 2 TB of log data ingestion in "AmazonCloudWatch PutLogEvents"), when I was expecting a few tens of dollars. I've logged into the Cloudwatch section of the AWS Console, and can see that one of my log groups used about 2TB of data, but there are thousands of different log streams in that log group, how can I tell which one used that amount of data?

回答1:

On the CloudWatch console, use the IncomingBytes metrics to find the amount of data ingested by each log group for a particular time period in uncompressed bytes using Metrics page. Follow the below steps -

Go to CloudWatch metrics page and click on AWS namespace 'Logs' --> 'Log Group Metrics'.
Select the IncomingBytes metrics of the required log groups and click on 'Graphed metrics' tab to see the graph.
Change the start time and end time such that their difference is 30 days and change the period to 30 days. This way, we will get only one data point. Also changed the graph to Number and statistics to Sum.

This way, you will see the amount of data ingested by each log groups and get an idea about which log group is ingesting how much.

You can also achieve the same result using AWS CLI. An example scenario where you just want to know the total amount of data ingested by log groups for say 30 days, you can use get-metric-statistics CLI command-

sample CLI command -

aws cloudwatch get-metric-statistics --metric-name IncomingBytes --start-time 2018-05-01T00:00:00Z --end-time 2018-05-30T23:59:59Z --period 2592000 --namespace AWS/Logs --statistics Sum --region us-east-1

sample output -

{
    "Datapoints": [
        {
            "Timestamp": "2018-05-01T00:00:00Z", 
            "Sum": 1686361672.0, 
            "Unit": "Bytes"
        }
    ], 
    "Label": "IncomingBytes"
}

To find the same for a particular log group, you can change the command to accommodate dimensions like -

aws cloudwatch get-metric-statistics --metric-name IncomingBytes --start-time 2018-05-01T00:00:00Z --end-time 2018-05-30T23:59:59Z --period 2592000 --namespace AWS/Logs --statistics Sum --region us-east-1 --dimensions Name=LogGroupName,Value=test1

One by one, you can run this command on all log groups and check which log group is responsible for most of the bill for data ingested and take corrective measures.

NOTE: Change the parameters specific to your environment and requirement.

The solution provided by OP gives data for the amount of logs stored which is different from logs ingested.

What is the difference?

Data ingested per month is not same as Data storage bytes. After the data is ingested to CloudWatch, it is archived by CloudWatch which includes 26 bytes of metadata per log event and is compressed using gzip level 6 compression. So the Storage bytes refers to the storage space used by Cloudwatch to store the logs after they're ingested.

Reference : https://docs.aws.amazon.com/cli/latest/reference/cloudwatch/get-metric-statistics.html

回答2:

We had a lambda logging GB of data of due to accidental check-in. Here's a boto3 based python script based on the info from the answers above that scans all log groups and prints out any group with logs greater than 1GB in the past 7 days. This helped me more than trying to use AWS dashboard which was slow to update.

#!/usr/bin/env python3

# Outputs all loggroups with > 1GB of incomingBytes in the past 7 days

import boto3
from datetime import datetime as dt
from datetime import timedelta


logs_client = boto3.client('logs')
cloudwatch_client = boto3.client('cloudwatch')

end_date = dt.today().isoformat(timespec='seconds')
start_date = (dt.today() - timedelta(days=7)).isoformat(timespec='seconds')
print("looking from %s to %s" % (start_date, end_date))

paginator = logs_client.get_paginator('describe_log_groups')
pages = paginator.paginate()
for page in pages:
     for json_data in page['logGroups']:
        log_group_name = json_data.get("logGroupName") 

        cw_response = cloudwatch_client.get_metric_statistics(
           Namespace='AWS/Logs',    
           MetricName='IncomingBytes',
           Dimensions=[
            {
                'Name': 'LogGroupName',
                'Value': log_group_name
            },
            ],
            StartTime= start_date,
            EndTime=end_date,
            Period=3600 * 24 * 7,
            Statistics=[
                'Sum'
            ],
            Unit='Bytes'
        )
        if len(cw_response.get("Datapoints")):
            stats_data = cw_response.get("Datapoints")[0]
            stats_sum = stats_data.get("Sum")   
            sum_GB = stats_sum /  (1000 * 1000 * 1000)
            if sum_GB > 1.0:
                print("%s = %.2f GB" % (log_group_name , sum_GB))

回答3:

Okay, I'm answering my own question here, but here we go (with all other answers welcome):

You can use a combination of AWS CLI tools, the csvfix CSV package and a spreadsheet to work this out.

Log into the AWS Cloudwatch Console and grab the name of the log group which has generated all the data. In my case it's called "test01-ecs".
Unfortunately in the Cloudwatch Console you can't sort the streams by "Stored Bytes" (which would tell you which ones are biggest). If there are too many streams in the log group to look through in the Console then you need to dump them somehow. For this you can use the AWS CLI tool:

$ aws logs describe-log-streams --log-group-name test01-ecs
The command above will give you JSON output (assuming your AWS CLI tool is set to JSON output - set it to output = json in ~/.aws/config if not) and it will look something like this:

{ "logStreams": [ { "creationTime": 1479218045690, "arn": "arn:aws:logs:eu-west-1:902720333704:log-group:test01-ecs:log-stream:test-spec/test-spec/0307d251-7764-459e-a68c-da47c3d9ecd9", "logStreamName": "test-spec/test-spec/0308d251-7764-4d9f-b68d-da47c3e9ebd8", "storedBytes": 7032 } ] }
Pipe this output to a JSON file - in my case the file was 31 MB in size:

$ aws logs describe-log-streams --log-group-name test01-ecs >> ./cloudwatch-output.json
Use the in2csv package (part of csvfix) to convert the JSON file to a CSV file which can easily be imported into a spreadsheet, making sure you define the logStreams key to be used to import on:

$ in2csv cloudwatch-output.json --key logStreams >> ./cloudwatch-output.csv
Import the resulting CSV file into a spreadsheet (I use LibreOffice myself as it seems great at dealing with CSV) making sure the storedBytes field is imported as an integer.
Sort the storedBytes column in the spreadsheet to work out which log stream or streams are generating the most data.

In my case this worked - it turned out one of my log streams (with logs from a broken TCP pipe in a redis instance) was 4,000 times the size of all the other streams combined!

来源：https://stackoverflow.com/questions/43327714/my-aws-cloudwatch-bill-is-huge-how-do-i-work-out-which-log-stream-is-causing-it

标签

amazon-web-services

amazon-cloudwatch

amazon-cloudwatchlogs