Save Dataframe to csv directly to s3 Python

前端未结

关注

 10  893

I have a pandas DataFrame that I want to upload to a new CSV file. The problem is that I don\'t want to save the file locally before transferring it to s3. Is there any meth

相关标签:

10条回答

走了就别回头了

2020-11-28 02:21
This is a more up to date answer:
```
import s3fs

s3 = s3fs.S3FileSystem(anon=False)

# Use 'w' for py3, 'wb' for py2
with s3.open('<bucket-name>/<filename>.csv','w') as f:
    df.to_csv(f)
```
The problem with StringIO is that it will eat away at your memory. With this method, you are streaming the file to s3, rather than converting it to string, then writing it into s3. Holding the pandas dataframe and its string copy in memory seems very inefficient.

If you are working in an ec2 instant, you can give it an IAM role to enable writing it to s3, thus you dont need to pass in credentials directly. However, you can also connect to a bucket by passing credentials to the S3FileSystem() function. See documention:https://s3fs.readthedocs.io/en/latest/
0 讨论(0)
发布评论:

提交评论
- 加载中...
走了就别回头了

2020-11-28 02:22
I found a very simple solution that seems to be working :
```
s3 = boto3.client("s3")

s3.put_object(
    Body=open("filename.csv").read(),
    Bucket="your-bucket",
    Key="your-key"
)
```
Hope that helps !
0 讨论(0)
发布评论:

提交评论
- 加载中...

故里飘歌

2020-11-28 02:23

You can directly use the S3 path. I am using Pandas 0.24.1

In [1]: import pandas as pd

In [2]: df = pd.DataFrame( [ [1, 1, 1], [2, 2, 2] ], columns=['a', 'b', 'c'])

In [3]: df
Out[3]:
   a  b  c
0  1  1  1
1  2  2  2

In [4]: df.to_csv('s3://experimental/playground/temp_csv/dummy.csv', index=False)

In [5]: pd.__version__
Out[5]: '0.24.1'

In [6]: new_df = pd.read_csv('s3://experimental/playground/temp_csv/dummy.csv')

In [7]: new_df
Out[7]:
   a  b  c
0  1  1  1
1  2  2  2

Release Note:

S3 File Handling

pandas now uses s3fs for handling S3 connections. This shouldn’t break any code. However, since s3fs is not a required dependency, you will need to install it separately, like boto in prior versions of pandas. GH11915.

0 讨论(0)

梦如初夏

2020-11-28 02:28

If you pass None as the first argument to to_csv() the data will be returned as a string. From there it's an easy step to upload that to S3 in one go.

It should also be possible to pass a StringIO object to to_csv(), but using a string will be easier.

0 讨论(0)
发布评论:

提交评论
- 加载中...
眼角桃花

2020-11-28 02:31
I like s3fs which lets you use s3 (almost) like a local filesystem.

You can do this:
```
import s3fs

bytes_to_write = df.to_csv(None).encode()
fs = s3fs.S3FileSystem(key=key, secret=secret)
with fs.open('s3://bucket/path/to/file.csv', 'wb') as f:
    f.write(bytes_to_write)
```
s3fs supports only rb and wb modes of opening the file, that's why I did this bytes_to_write stuff.
0 讨论(0)
发布评论:

提交评论
- 加载中...
遥遥无期

2020-11-28 02:40
You can also use the AWS Data Wrangler:
```
import awswrangler as wr
    
wr.s3.to_csv(
    df=df,
    path="s3://...",
)
```
Note that it will handle multipart upload for you to make the upload faster.
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页