Write pandas dataframe as compressed CSV directly to Amazon s3 bucket?

后端 未结 3 418
醉酒成梦
醉酒成梦 2021-01-02 23:09

I currently have a script that reads the existing version of a csv saved to s3, combines that with the new rows in the pandas dataframe, and then writes that directly back t

相关标签:
3条回答
  • 2021-01-02 23:50

    There is a more elegant solution using smart-open (https://pypi.org/project/smart-open/)

    import pandas as pd
    from smart_open import open
    
    df.to_csv(open('s3://bucket/prefix/filename.csv.gz','w'),index = False)
    
    0 讨论(0)
  • 2021-01-02 23:56

    If you want streaming writes (to not hold (de)compressed CSV in memory), you can do this:

    import s3fs
    import io
    import gzip
    
        def write_df_to_s3(df, filename, path):
            s3 = s3fs.S3FileSystem(anon=False)
            with s3.open(path, 'wb') as f:
                gz = gzip.GzipFile(filename, mode='wb', compresslevel=9, fileobj=f)
                buf = io.TextIOWrapper(gz)
                df.to_csv(buf, index=False, encoding='UTF_8')
                gz.flush()
                gz.close()
    

    TextIOWrapper is needed until this issue is fixed: https://github.com/pandas-dev/pandas/issues/19827

    0 讨论(0)
  • 2021-01-03 00:05

    Here's a solution in Python 3.5.2 using Pandas 0.20.1.

    The source DataFrame can be read from a S3, a local CSV, or whatever.

    import boto3
    import gzip
    import pandas as pd
    from io import BytesIO, TextIOWrapper
    
    df = pd.read_csv('s3://ramey/test.csv')
    gz_buffer = BytesIO()
    
    with gzip.GzipFile(mode='w', fileobj=gz_buffer) as gz_file:
        df.to_csv(TextIOWrapper(gz_file, 'utf8'), index=False)
    
    s3_resource = boto3.resource('s3')
    s3_object = s3_resource.Object('ramey', 'new-file.csv.gz')
    s3_object.put(Body=gz_buffer.getvalue())
    
    0 讨论(0)
提交回复
热议问题