How to use Pyarrow to achieve stream writing effect

问题

The data I have is a kind of streaming data. And I want to store them into a single Parquet file. But Pyarrow will overwrite the Parquet file everytime. So How should I do?

I try not to close the writer but it seems impossible since If I didn't close it, then I could not read this file.

Here is the package:

import pyarrow.parquet as pp
import pyarrow as pa

for name in ['LEE','LSY','asd','wer']:
    writer=pq.ParquetWriter('d:/test.parquet', table.schema)
    arrays=[pa.array([name]),pa.array([2])]
    field=[pa.field('name',pa.string()),pa.field('age',pa.int64())]
    table=pa.Table.from_arrays(arrays,schema=pa.schema(field))
    writer.write_table(table)
writer.close()

But actually I want to close the writer everytime, and reopen it to append one line to the data which like this:

for name in ['LEE','LSY','asd','wer']:
    writer=pq.ParquetWriter('d:/test.parquet', table.schema)
    arrays=[pa.array([name]),pa.array([2])]
    field=[pa.field('name',pa.string()),pa.field('age',pa.int64())]
    table=pa.Table.from_arrays(arrays,schema=pa.schema(field))
    writer.write_table(table)
    writer.close()

回答1:

Parquet files cannot be appended once they are written. The typical solution for this case to write a new parquet file each time (which can together form a single partitioned parquet dataset), or, if it is not much data, first gather the data in python into a single table and then write once.

See this email thread with some more discussion about it: https://lists.apache.org/thread.html/07b1e3f13b5dae7e34ee3752f3cd4d16a94deb3a5f43893b73475900@%3Cdev.arrow.apache.org%3E

来源：https://stackoverflow.com/questions/56747062/how-to-use-pyarrow-to-achieve-stream-writing-effect

标签

parquet

pyarrow