Methods for writing Parquet files using Python?

元气小坏坏 提交于 2019-12-20 11:24:07

问题


I'm having trouble finding a library that allows Parquet files to be written using Python. Bonus points if I can use Snappy or a similar compression mechanism in conjunction with it.

Thus far the only method I have found is using Spark with the pyspark.sql.DataFrame Parquet support.

I have some scripts that need to write Parquet files that are not Spark jobs. Is there any approach to writing Parquet files in Python that doesn't involve pyspark.sql?


回答1:


Update (March 2017): There are currently 2 libraries capable of writing Parquet files:

  1. fastparquet
  2. pyarrow

Both of them are still under heavy development it seems and they come with a number of disclaimers (no support for nested data e.g.), so you will have to check whether they support everything you need.

OLD ANSWER:

As of 2.2016 there seems to be NO python-only library capable of writing Parquet files.

If you only need to read Parquet files there is python-parquet.

As a workaround you will have to rely on some other process like e.g. pyspark.sql (which uses Py4J and runs on the JVM and can thus not be used directly from your average CPython program).




回答2:


fastparquet does have write support, here is a snippet to write data to a file

from fastparquet import write
write('outfile.parq', df)



回答3:


using fastparquet you can write a pandas df to parquet either withsnappy or gzip compression as follows:

make sure you have installed the following:

$ conda install python-snappy
$ conda install fastparquet

do imports

import pandas as pd 
import snappy
import fastparquet

assume you have the following pandas df

df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})

send df to parquet with snappy compression

df.to_parquet('df.snap.parquet',compression='snappy')

send df to parquet with gzip compression

df.to_parquet('df.gzip.parquet',compression='gzip')

check:

read parquet back into pandas df

pd.read_parquet('df.snap.parquet')

or

pd.read_parquet('df.gzip.parquet')

output:

   col1 col2
0   1    3
1   2    4



回答4:


pyspark seems to be the best alternative right now for writing out parquet with python. It may seem like using a sword in place of needle, but thats how it is at the moment.

  • It supports most compression types like lzo, snappy. Zstd support should come into it soon.
  • Has complete schema support (nested, structs, etc)

Simply do, pip install pyspark and you are good to go.

https://spark.apache.org/docs/latest/sql-data-sources-parquet.html



来源:https://stackoverflow.com/questions/32940416/methods-for-writing-parquet-files-using-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!