Methods for writing Parquet files using Python?

后端 未结 6 720
再見小時候
再見小時候 2021-02-02 09:30

I\'m having trouble finding a library that allows Parquet files to be written using Python. Bonus points if I can use Snappy or a similar compression mechanism in conjunction wi

相关标签:
6条回答
  • 2021-02-02 09:58

    using fastparquet you can write a pandas df to parquet either withsnappy or gzip compression as follows:

    make sure you have installed the following:

    $ conda install python-snappy
    $ conda install fastparquet
    

    do imports

    import pandas as pd 
    import snappy
    import fastparquet
    

    assume you have the following pandas df

    df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
    

    send df to parquet with snappy compression

    df.to_parquet('df.snap.parquet',compression='snappy')
    

    send df to parquet with gzip compression

    df.to_parquet('df.gzip.parquet',compression='gzip')
    

    check:

    read parquet back into pandas df

    pd.read_parquet('df.snap.parquet')
    

    or

    pd.read_parquet('df.gzip.parquet')
    

    output:

       col1 col2
    0   1    3
    1   2    4
    
    0 讨论(0)
  • 2021-02-02 10:02

    I've written a comprehensive guide to Python and Parquet with an emphasis on taking advantage of Parquet's three primary optimizations: columnar storage, columnar compression and data partitioning. There is a fourth optimization that isn't covered yet, row groups, but they aren't commonly used. The ways of working with Parquet in Python are pandas, PyArrow, fastparquet, PySpark, Dask and AWS Data Wrangler.

    Check out the post here: Python and Parquet Performance In Pandas, PyArrow, fastparquet, AWS Data Wrangler, PySpark and Dask

    0 讨论(0)
  • 2021-02-02 10:03

    Simple method to write pandas dataframe to parquet.

    Assuming, df is the pandas dataframe. We need to import following libraries.

    import pyarrow as pa
    import pyarrow.parquet as pq
    

    First, write the datafrmae df into a pyarrow table.

    # Convert DataFrame to Apache Arrow Table
    table = pa.Table.from_pandas(df_image_0)
    

    Second, write the table into paraquet file say file_name.paraquet

    # Parquet with Brotli compression
    pq.write_table(table, 'file_name.paraquet')
    

    NOTE: paraquet files can be further compressed while writing. Following are the popular compression formats.

    • Snappy ( default, requires no argument)
    • gzip
    • brotli

    Parquet with Snappy compression

     pq.write_table(table, 'file_name.paraquet')
    

    Parquet with GZIP compression

    pq.write_table(table, 'file_name.paraquet', compression='GZIP')
    

    Parquet with Brotli compression

    pq.write_table(table, 'file_name.paraquet', compression='BROTLI')
    

    Comparative comparision achieved with different formats of paraquet

    Reference: https://tech.jda.com/efficient-dataframe-storage-with-apache-parquet/

    0 讨论(0)
  • 2021-02-02 10:06

    Update (March 2017): There are currently 2 libraries capable of writing Parquet files:

    1. fastparquet
    2. pyarrow

    Both of them are still under heavy development it seems and they come with a number of disclaimers (no support for nested data e.g.), so you will have to check whether they support everything you need.

    OLD ANSWER:

    As of 2.2016 there seems to be NO python-only library capable of writing Parquet files.

    If you only need to read Parquet files there is python-parquet.

    As a workaround you will have to rely on some other process like e.g. pyspark.sql (which uses Py4J and runs on the JVM and can thus not be used directly from your average CPython program).

    0 讨论(0)
  • 2021-02-02 10:07

    fastparquet does have write support, here is a snippet to write data to a file

    from fastparquet import write
    write('outfile.parq', df)
    
    0 讨论(0)
  • 2021-02-02 10:23

    pyspark seems to be the best alternative right now for writing out parquet with python. It may seem like using a sword in place of needle, but thats how it is at the moment.

    • It supports most compression types like lzo, snappy. Zstd support should come into it soon.
    • Has complete schema support (nested, structs, etc)

    Simply do, pip install pyspark and you are good to go.

    https://spark.apache.org/docs/latest/sql-data-sources-parquet.html

    0 讨论(0)
提交回复
热议问题