Fastest way to sync two Amazon S3 buckets

后端 未结 5 1063
滥情空心
滥情空心 2020-12-15 05:13

I have a S3 bucket with around 4 million files taking some 500GB in total. I need to sync the files to a new bucket (actually changing the name of the bucket would suffice,

相关标签:
5条回答
  • 2020-12-15 05:42

    As a variant of what OP is already doing..
    One could create a list of all files to be synced, with aws s3 sync --dryrun

    aws s3 sync s3://source-bucket s3://destination-bucket --dryrun
    # or even
    aws s3 ls s3://source-bucket --recursive
    

    Using the list of objects to be synced, split the job into multiple aws s3 cp ... commands. This way, "aws cli" won't be just hanging there, while getting a list of sync candidates, as it does when one starts multiple sync jobs with --exclude "*" --include "1?/*" type arguments.

    When all "copy" jobs are done, another sync might be worth it, for good measure, perhaps with --delete, if object might get deleted from "source" bucket.

    In case of "source" and "destination" buckets located in different regions, one could enable cross-region bucket replication, before starting to sync the buckets..

    0 讨论(0)
  • 2020-12-15 05:44

    40100 objects 160gb was copied/sync in less than 90 seconds

    follow the below steps:

    step1- select the source folder
    step2- under the properties of the source folder choose advance setting
    step3- enable transfer acceleration and get the endpoint 
    

    AWS configurations one time only (no need to repeat this every time)

    aws configure set default.region us-east-1 #set it to your default region
    aws configure set default.s3.max_concurrent_requests 2000
    aws configure set default.s3.use_accelerate_endpoint true
    

    options :-

    --delete : this option will delete the file in destination if its not present in the source

    AWS command to sync

    aws s3 sync s3://source-test-1992/foldertobesynced/ s3://destination-test-1992/foldertobesynced/ --delete --endpoint-url http://soucre-test-1992.s3-accelerate.amazonaws.com 
    

    transfer acceleration cost

    https://aws.amazon.com/s3/pricing/#S3_Transfer_Acceleration_pricing

    they have not mentioned pricing if buckets are in the same region

    0 讨论(0)
  • 2020-12-15 05:45

    You can use EMR and S3-distcp. I had to sync 153 TB between two buckets and this took about 9 days. Also make sure the buckets are in the same region because you also get hit with data transfer costs.

    aws emr add-steps --cluster-id <value> --steps Name="Command Runner",Jar="command-runner.jar",[{"Args":["s3-dist-cp","--s3Endpoint","s3.amazonaws.com","--src","s3://BUCKETNAME","--dest","s3://BUCKETNAME"]}]
    

    http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html

    http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-commandrunner.html

    0 讨论(0)
  • 2020-12-15 05:46

    New option in 2020:

    We had to move about 500 terabytes (10 million files) of client data between S3 buckets. Since we only had a month to finish the whole project, and aws sync tops out at about 120megabytes/s... We knew right away this was going to be trouble.

    I found this stackoverflow thread first, but when I tried most of the options here, they just weren't fast enough. The main problem is they all rely on serial item-listing. In order to solve the problem, I figured out a way to parallelize listing any bucket without any a priori knowledge. Yes, it can be done!

    The open source tool is called S3P.

    With S3P we were able to sustain copy speeds of 8 gigabytes/second and listing speeds of 20,000 items/second using a single EC2 instance. (It's a bit faster to run S3P on EC2 in the same region as the buckets, but S3P is almost as fast running on a local machine.)

    More info:

    • Bog post on S3P
    • S3P on NPM

    Or just try it out:

    # Run in any shell to get command-line help. No installation needed:
    
    npx s3p
    

    (requirements nodejs, aws-cli and valid aws-cli credentials)

    0 讨论(0)
  • 2020-12-15 05:59

    Background: The bottlenecks in the sync command is listing objects and copying objects. Listing objects is normally a serial operation, although if you specify a prefix you can list a subset of objects. This is the only trick to parallelizing it. Copying objects can be done in parallel.

    Unfortunately, aws s3 sync doesn't do any parallelizing, and it doesn't even support listing by prefix unless the prefix ends in / (ie, it can list by folder). This is why it's so slow.

    s3s3mirror (and many similar tools) parallelizes the copying. I don't think it (or any other tools) parallelizes listing objects because this requires a priori knowledge of how the objects are named. However, it does support prefixes and you can invoke it multiple times for each letter of the alphabet (or whatever is appropriate).

    You can also roll-your-own using the AWS API.

    Lastly, the aws s3 sync command itself (and any tool for that matter) should be a bit faster if you launch it in an instance in the same region as your S3 bucket.

    0 讨论(0)
提交回复
热议问题