Does aws-cli confirm checksums when uploading files to S3, or do I need to manage that myself?

后端 未结 2 1524
萌比男神i
萌比男神i 2021-02-20 11:20

If I\'m uploading data to S3 using the aws-cli (i.e. using aws s3 cp), does aws-cli do any work to confirm that the resulting file in S3 matches the original file,

相关标签:
2条回答
  • 2021-02-20 12:01

    The AWS support page How do I ensure data integrity of objects uploaded to or downloaded from Amazon S3? describes how to achieve this.

    Firstly determine the base64 encoded md5sum of the file you wish to upload:

    $ md5_sum_base64="$( openssl md5 -binary my-file | base64 )"
    

    Then use the s3api to upload the file:

    $ aws s3api put-object --bucket my-bucket --key my-file --body my-file --content-md5 "$md5_sum_base64"
    

    Note the use of the --content-md5 flag, the help for this flag states:

    --content-md5  (string)  The  base64-encoded  128-bit MD5 digest of the part data.
    

    This does not say much about why to use this flag, but we can find this information in the API documentation for put object:

    To ensure that data is not corrupted traversing the network, use the Content-MD5 header. When you use this header, Amazon S3 checks the object against the provided MD5 value and, if they do not match, returns an error. Additionally, you can calculate the MD5 while putting an object to Amazon S3 and compare the returned ETag to the calculated MD5 value.

    Using this flag causes S3 to verify that the file hash serverside matches the specified value. If the hashes match s3 will return the ETag:

    {
        "ETag": "\"599393a2c526c680119d84155d90f1e5\""
    }
    

    The ETag value will usually be the hexadecimal md5sum (see this question for some scenarios where this may not be the case).

    If the hash does not match the one you specified you get an error.

    A client error (InvalidDigest) occurred when calling the PutObject operation: The Content-MD5 you specified was invalid.
    

    In addition to this you can also add the file md5sum to the file metadata as an additional check:

    $ aws s3api put-object --bucket my-bucket --key my-file --body my-file --content-md5 "$md5_sum_base64" --metadata md5chksum="$md5_sum_base64"
    

    After upload you can issue the head-object command to check the values.

    $ aws s3api head-object --bucket my-bucket --key my-file
    {
        "AcceptRanges": "bytes",
        "ContentType": "binary/octet-stream",
        "LastModified": "Thu, 31 Mar 2016 16:37:18 GMT",
        "ContentLength": 605,
        "ETag": "\"599393a2c526c680119d84155d90f1e5\"",
        "Metadata": {    
            "md5chksum": "WZOTosUmxoARnYQVXZDx5Q=="    
        }    
    }
    

    Here is a bash script that uses content md5 and adds metadata and then verifies that the values returned by S3 match the local hashes:

    #!/bin/bash
    
    set -euf -o pipefail
    
    # assumes you have aws cli, jq installed
    
    # change these if required
    tmp_dir="$HOME/tmp"
    s3_dir="foo"
    s3_bucket="stack-overflow-example"
    aws_region="ap-southeast-2"
    aws_profile="my-profile"
    
    test_dir="$tmp_dir/s3-md5sum-test"
    file_name="MailHog_linux_amd64"
    test_file_url="https://github.com/mailhog/MailHog/releases/download/v1.0.0/MailHog_linux_amd64"
    s3_key="$s3_dir/$file_name"
    return_dir="$( pwd )"
    
    cd "$tmp_dir" || exit
    mkdir "$test_dir"
    cd "$test_dir" || exit
    
    wget "$test_file_url"
    
    md5_sum_hex="$( md5sum $file_name | awk '{ print $1 }' )"
    md5_sum_base64="$( openssl md5 -binary $file_name | base64 )"
    
    echo "$file_name hex    = $md5_sum_hex"
    echo "$file_name base64 = $md5_sum_base64"
    
    echo "Uploading $file_name to s3://$s3_bucket/$s3_dir/$file_name"
    aws \
    --profile "$aws_profile" \
    --region "$aws_region" \
    s3api put-object \
    --bucket "$s3_bucket" \
    --key "$s3_key" \
    --body "$file_name" \
    --metadata md5chksum="$md5_sum_base64" \
    --content-md5 "$md5_sum_base64"
    
    echo "Verifying sums match"
    
    s3_md5_sum_hex=$( aws --profile "$aws_profile"  --region "$aws_region" s3api head-object --bucket "$s3_bucket" --key "$s3_key" | jq -r '.ETag' | sed 's/"//'g )
    s3_md5_sum_base64=$( aws --profile "$aws_profile"  --region "$aws_region" s3api head-object --bucket "$s3_bucket" --key "$s3_key" | jq -r '.Metadata.md5chksum' )
    
    if [ "$md5_sum_hex" == "$s3_md5_sum_hex" ] && [ "$md5_sum_base64" == "$s3_md5_sum_base64" ]; then
        echo "checksums match"
    else
        echo "something is wrong checksums do not match:"
    
        cat <<EOM | column -t -s ' '
    $file_name file hex:    $md5_sum_hex    s3 hex:    $s3_md5_sum_hex
    $file_name file base64: $md5_sum_base64 s3 base64: $s3_md5_sum_base64
    EOM
    
    fi
    
    echo "Cleaning up"
    cd "$return_dir"
    rm -rf "$test_dir"
    aws \
    --profile "$aws_profile" \
    --region "$aws_region" \
    s3api delete-object \
    --bucket "$s3_bucket" \
    --key "$s3_key"
    
    0 讨论(0)
  • 2021-02-20 12:08

    According to the faq from the aws-cli github, the checksums are checked in most cases during upload and download.

    Key points for uploads:

    • The AWS CLI calculates the Content-MD5 header for both standard and multipart uploads.
    • If the checksum that S3 calculates does not match the Content-MD5 provided, S3 will not store the object and instead will return an error message back the AWS CLI.
    • The AWS CLI will retry this error up to 5 times before giving up and exiting with a nonzero exit code.
    0 讨论(0)
提交回复
热议问题