Pandas cannot read parquet files created in PySpark

前端 未结 3 673
自闭症患者
自闭症患者 2021-01-12 16:43

I am writing a parquet file from a Spark DataFrame the following way:

df.write.parquet(\"path/myfile.parquet\", mode = \"overwrite\", compression=\"gzip\")
<         


        
3条回答
  •  执笔经年
    2021-01-12 17:05

    The problem is that Spark partitions the file due to its distributed nature (each executor writes a file inside the directory that receives the filename). This is not something supported by Pandas, which expects a file, not a path.

    You can circumvent this issue in different ways:

    • Reading the file with an alternative utility, such as the pyarrow.parquet.ParquetDataset, and then convert that to Pandas (I did not test this code).

        arrow_dataset = pyarrow.parquet.ParquetDataset('path/myfile.parquet')
        arrow_table = arrow_dataset.read()
        pandas_df = arrow_table.to_pandas()
      
    • Another way is to read the separate fragments separately and then concatenate them, as this answer suggest: Read multiple parquet files in a folder and write to single csv file using python

提交回复
热议问题