Save Dataframe to csv directly to s3 Python

前端 未结 10 884
独厮守ぢ
独厮守ぢ 2020-11-28 02:02

I have a pandas DataFrame that I want to upload to a new CSV file. The problem is that I don\'t want to save the file locally before transferring it to s3. Is there any meth

相关标签:
10条回答
  • 2020-11-28 02:21

    This is a more up to date answer:

    import s3fs
    
    s3 = s3fs.S3FileSystem(anon=False)
    
    # Use 'w' for py3, 'wb' for py2
    with s3.open('<bucket-name>/<filename>.csv','w') as f:
        df.to_csv(f)
    

    The problem with StringIO is that it will eat away at your memory. With this method, you are streaming the file to s3, rather than converting it to string, then writing it into s3. Holding the pandas dataframe and its string copy in memory seems very inefficient.

    If you are working in an ec2 instant, you can give it an IAM role to enable writing it to s3, thus you dont need to pass in credentials directly. However, you can also connect to a bucket by passing credentials to the S3FileSystem() function. See documention:https://s3fs.readthedocs.io/en/latest/

    0 讨论(0)
  • 2020-11-28 02:22

    I found a very simple solution that seems to be working :

    s3 = boto3.client("s3")
    
    s3.put_object(
        Body=open("filename.csv").read(),
        Bucket="your-bucket",
        Key="your-key"
    )
    
    

    Hope that helps !

    0 讨论(0)
  • 2020-11-28 02:23

    You can directly use the S3 path. I am using Pandas 0.24.1

    In [1]: import pandas as pd
    
    In [2]: df = pd.DataFrame( [ [1, 1, 1], [2, 2, 2] ], columns=['a', 'b', 'c'])
    
    In [3]: df
    Out[3]:
       a  b  c
    0  1  1  1
    1  2  2  2
    
    In [4]: df.to_csv('s3://experimental/playground/temp_csv/dummy.csv', index=False)
    
    In [5]: pd.__version__
    Out[5]: '0.24.1'
    
    In [6]: new_df = pd.read_csv('s3://experimental/playground/temp_csv/dummy.csv')
    
    In [7]: new_df
    Out[7]:
       a  b  c
    0  1  1  1
    1  2  2  2
    
    

    Release Note:

    S3 File Handling

    pandas now uses s3fs for handling S3 connections. This shouldn’t break any code. However, since s3fs is not a required dependency, you will need to install it separately, like boto in prior versions of pandas. GH11915.

    0 讨论(0)
  • 2020-11-28 02:28

    If you pass None as the first argument to to_csv() the data will be returned as a string. From there it's an easy step to upload that to S3 in one go.

    It should also be possible to pass a StringIO object to to_csv(), but using a string will be easier.

    0 讨论(0)
  • 2020-11-28 02:31

    I like s3fs which lets you use s3 (almost) like a local filesystem.

    You can do this:

    import s3fs
    
    bytes_to_write = df.to_csv(None).encode()
    fs = s3fs.S3FileSystem(key=key, secret=secret)
    with fs.open('s3://bucket/path/to/file.csv', 'wb') as f:
        f.write(bytes_to_write)
    

    s3fs supports only rb and wb modes of opening the file, that's why I did this bytes_to_write stuff.

    0 讨论(0)
  • 2020-11-28 02:40

    You can also use the AWS Data Wrangler:

    import awswrangler as wr
        
    wr.s3.to_csv(
        df=df,
        path="s3://...",
    )
    

    Note that it will handle multipart upload for you to make the upload faster.

    0 讨论(0)
提交回复
热议问题