Can pyarrow write multiple parquet files to a folder like fastparquet's file_scheme='hive' option?

天大地大妈咪最大 提交于 2019-12-10 17:53:16

问题


I have a multi-million record SQL table that I'm planning to write out to many parquet files in a folder, using the pyarrow library. The data content seems too large to store in a single parquet file.

However, I can't seem to find an API or parameter with the pyarrow library that allows me to specify something like:

file_scheme="hive"

As is supported by the fastparquet python library.

Here's my sample code:

#!/usr/bin/python

import pyodbc
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

conn_str = 'UID=username;PWD=passwordHere;' + 
    'DRIVER=FreeTDS;SERVERNAME=myConfig;DATABASE=myDB'

#----> Query the SQL database into a Pandas dataframe
conn = pyodbc.connect( conn_str, autocommit=False)
sql = "SELECT * FROM ClientAccount (NOLOCK)"
df = pd.io.sql.read_sql(sql, conn)


#----> Convert the dataframe to a pyarrow table and write it out
table = pa.Table.from_pandas(df)
pq.write_table(table, './clients/' )

This throws an error:

File "/usr/local/lib/python2.7/dist-packages/pyarrow/parquet.py", line 912, in write_table
    os.remove(where)
OSError: [Errno 21] Is a directory: './clients/'

If I replace that last line with the following, it works fine but writes only one big file:

pq.write_table(table, './clients.parquet' )

Any ideas how I can do the multi-file output thing with pyarrow?


回答1:


Try pyarrow.parquet.write_to_dataset https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L938.

I opened https://issues.apache.org/jira/browse/ARROW-1858 about adding some more documentation about this.

I recommend seeking support for Apache Arrow on the mailing list dev@arrow.apache.org. Thanks!



来源:https://stackoverflow.com/questions/47482434/can-pyarrow-write-multiple-parquet-files-to-a-folder-like-fastparquets-file-sch

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!