问题
I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and save to a dataframe
? Ultimately, I would like to have the data in dataframe
format to work with.
Am I wrong to attempt to do this without using a spark framework?
I have tried using pyarrow
and fastparquet
but I get memory errors on trying to read the entire file in.
Any tips or suggestions would be greatly appreciated!
回答1:
Spark is certainly a viable choice for this task.
We're planning to add streaming read logic in pyarrow
this year (2019, see https://issues.apache.org/jira/browse/ARROW-3771 and related issues). In the meantime, I would recommend reading one row group at a time to mitigate the memory use issues. You can do this with pyarrow.parquet.ParquetFile
and its read_row_group
method
回答2:
This is not an answer, I'm posting here because this is the only relevant post I can find on Stackoverflow. I'm trying to use read_row_group
function, but python would just exit with code 139. There is no other error messages, not sure how to fix this..
from pyarrow.parquet import ParquetFile
path = "sample.parquet"
f = ParquetFile(source = path)
print(f.num_row_groups) # it will print number of groups
# if I read the entire file:
df = f.read() # this works
# try to read row group
row_df = f.read_row_group(0)
# I get
Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
Python version 3.6.3
pyarrow version 0.11.1
来源:https://stackoverflow.com/questions/54008975/streaming-parquet-file-python-and-only-downsampling