Streaming parquet file python and only downsampling

问题

I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and save to a dataframe? Ultimately, I would like to have the data in dataframe format to work with.

Am I wrong to attempt to do this without using a spark framework?

I have tried using pyarrow and fastparquet but I get memory errors on trying to read the entire file in. Any tips or suggestions would be greatly appreciated!

回答1:

Spark is certainly a viable choice for this task.

We're planning to add streaming read logic in pyarrow this year (2019, see https://issues.apache.org/jira/browse/ARROW-3771 and related issues). In the meantime, I would recommend reading one row group at a time to mitigate the memory use issues. You can do this with pyarrow.parquet.ParquetFile and its read_row_group method

回答2:

This is not an answer, I'm posting here because this is the only relevant post I can find on Stackoverflow. I'm trying to use read_row_group function, but python would just exit with code 139. There is no other error messages, not sure how to fix this..

from pyarrow.parquet import ParquetFile
path = "sample.parquet"
f = ParquetFile(source = path)
print(f.num_row_groups) # it will print number of groups

# if I read the entire file:
df = f.read() # this works

# try to read row group
row_df = f.read_row_group(0)

# I get
Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

Python version 3.6.3

pyarrow version 0.11.1

来源：https://stackoverflow.com/questions/54008975/streaming-parquet-file-python-and-only-downsampling

标签

python-3.x

parquet

pyarrow

fastparquet