Streaming parquet file python and only downsampling

回眸只為那壹抹淺笑 提交于 2019-12-07 16:07:25

问题


I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and save to a dataframe? Ultimately, I would like to have the data in dataframe format to work with.

Am I wrong to attempt to do this without using a spark framework?

I have tried using pyarrow and fastparquet but I get memory errors on trying to read the entire file in. Any tips or suggestions would be greatly appreciated!


回答1:


Spark is certainly a viable choice for this task.

We're planning to add streaming read logic in pyarrow this year (2019, see https://issues.apache.org/jira/browse/ARROW-3771 and related issues). In the meantime, I would recommend reading one row group at a time to mitigate the memory use issues. You can do this with pyarrow.parquet.ParquetFile and its read_row_group method




回答2:


This is not an answer, I'm posting here because this is the only relevant post I can find on Stackoverflow. I'm trying to use read_row_group function, but python would just exit with code 139. There is no other error messages, not sure how to fix this..

from pyarrow.parquet import ParquetFile
path = "sample.parquet"
f = ParquetFile(source = path)
print(f.num_row_groups) # it will print number of groups

# if I read the entire file:
df = f.read() # this works

# try to read row group
row_df = f.read_row_group(0)

# I get
Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

Python version 3.6.3

pyarrow version 0.11.1



来源:https://stackoverflow.com/questions/54008975/streaming-parquet-file-python-and-only-downsampling

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!