问题
The data I have is a kind of streaming data. And I want to store them into a single Parquet file. But Pyarrow will overwrite the Parquet file everytime. So How should I do?
I try not to close the writer but it seems impossible since If I didn't close it, then I could not read this file.
Here is the package:
import pyarrow.parquet as pp
import pyarrow as pa
for name in ['LEE','LSY','asd','wer']:
writer=pq.ParquetWriter('d:/test.parquet', table.schema)
arrays=[pa.array([name]),pa.array([2])]
field=[pa.field('name',pa.string()),pa.field('age',pa.int64())]
table=pa.Table.from_arrays(arrays,schema=pa.schema(field))
writer.write_table(table)
writer.close()
But actually I want to close the writer everytime, and reopen it to append one line to the data which like this:
for name in ['LEE','LSY','asd','wer']:
writer=pq.ParquetWriter('d:/test.parquet', table.schema)
arrays=[pa.array([name]),pa.array([2])]
field=[pa.field('name',pa.string()),pa.field('age',pa.int64())]
table=pa.Table.from_arrays(arrays,schema=pa.schema(field))
writer.write_table(table)
writer.close()
回答1:
Parquet files cannot be appended once they are written. The typical solution for this case to write a new parquet file each time (which can together form a single partitioned parquet dataset), or, if it is not much data, first gather the data in python into a single table and then write once.
See this email thread with some more discussion about it: https://lists.apache.org/thread.html/07b1e3f13b5dae7e34ee3752f3cd4d16a94deb3a5f43893b73475900@%3Cdev.arrow.apache.org%3E
来源:https://stackoverflow.com/questions/56747062/how-to-use-pyarrow-to-achieve-stream-writing-effect