I would like to grab a file straight of the Internet and stick it into an S3 bucket to then copy it over to a PIG cluster. Due to the size of the file and my not so good int
[2017 edit] I gave the original answer back at 2013. Today I'd recommend using AWS Lambda to download a file and put it on S3. It's the desired effect - to place an object on S3 with no server involved.
[Original answer] It is not possible to do it directly.
Why not do this with EC2 instance instead of your local PC? Upload speed from EC2 to S3 in the same region is very good.
regarding stream reading/writing from/to s3 I use python's smart_open
For anyone (like me) less experienced, here is a more detailed description of the process via EC2:
Launch an Amazon EC2 instance in the same region as the target S3 bucket. Smallest available (default Amazon Linux) instance should be fine, but be sure to give it enough storage space to save your file(s). If you need transfer speeds above ~20MB/s, consider selecting an instance with larger pipes.
Launch an SSH connection to the new EC2 instance, then download the file(s), for instance using wget
. (For example, to download an entire directory via FTP, you might use wget -r ftp://name:passwd@ftp.com/somedir/
.)
Using AWS CLI (see Amazon's documentation), upload the file(s) to your S3 bucket. For example, aws s3 cp myfolder s3://mybucket/myfolder --recursive
(for an entire directory). (Before this command will work you need to add your S3 security credentials to a config file, as described in the Amazon documentation.)
Terminate/destroy your EC2 instance.
You can stream the file from internet to AWS S3 using Python.
s3=boto3.resource('s3')
http=urllib3.PoolManager()
urllib.request.urlopen('<Internet_URL>') #Provide URL
s3.meta.client.upload_fileobj(http.request('GET', 'Internet_URL>', preload_content=False), s3Bucket, key,
ExtraArgs={'ServerSideEncryption':'aws:kms','SSEKMSKeyId':'<alias_name>'})
Download the data via curl
and pipe the contents straight to S3. The data is streamed directly to S3 and not stored locally, avoiding any memory issues.
curl "https://download-link-address/" | aws s3 cp - s3://aws-bucket/data-file
As suggested above, if download speed is too slow on your local computer, launch an EC2 instance, ssh
in and execute the above command there.