Pandas cannot read parquet files created in PySpark

前端 未结 3 674
自闭症患者
自闭症患者 2021-01-12 16:43

I am writing a parquet file from a Spark DataFrame the following way:

df.write.parquet(\"path/myfile.parquet\", mode = \"overwrite\", compression=\"gzip\")
<         


        
相关标签:
3条回答
  • 2021-01-12 16:59

    If the parquet file has been created with spark, (so it's a directory) to import it to pandas use

    from pyarrow.parquet import ParquetDataset
    
    dataset = ParquetDataset("file.parquet")
    table = dataset.read()
    df = table.to_pandas()
    
    0 讨论(0)
  • 2021-01-12 17:05

    The problem is that Spark partitions the file due to its distributed nature (each executor writes a file inside the directory that receives the filename). This is not something supported by Pandas, which expects a file, not a path.

    You can circumvent this issue in different ways:

    • Reading the file with an alternative utility, such as the pyarrow.parquet.ParquetDataset, and then convert that to Pandas (I did not test this code).

        arrow_dataset = pyarrow.parquet.ParquetDataset('path/myfile.parquet')
        arrow_table = arrow_dataset.read()
        pandas_df = arrow_table.to_pandas()
      
    • Another way is to read the separate fragments separately and then concatenate them, as this answer suggest: Read multiple parquet files in a folder and write to single csv file using python

    0 讨论(0)
  • 2021-01-12 17:22

    Since this still seems to be an issue even with newer pandas versions, I wrote some functions to circumvent this as part of a larger pyspark helpers library:

    import pandas as pd
    import datetime
    
    def read_parquet_folder_as_pandas(path, verbosity=1):
      files = [f for f in os.listdir(path) if f.endswith("parquet")]
    
      if verbosity > 0:
        print("{} parquet files found. Beginning reading...".format(len(files)), end="")
        start = datetime.datetime.now()
    
      df_list = [pd.read_parquet(os.path.join(path, f)) for f in files]
      df = pd.concat(df_list, ignore_index=True)
    
      if verbosity > 0:
        end = datetime.datetime.now()
        print(" Finished. Took {}".format(end-start))
      return df
    
    
    def read_parquet_as_pandas(path, verbosity=1):
      """Workaround for pandas not being able to read folder-style parquet files.
      """
      if os.path.isdir(path):
        if verbosity>1: print("Parquet file is actually folder.")
        return read_parquet_folder_as_pandas(path, verbosity)
      else:
        return pd.read_parquet(path)
    

    This assumes that the relevant files in the parquet "file", which is actually a folder, end with ".parquet". This works for parquet files exported by databricks and might work with others as well (untested, happy about feedback in the comments).

    The function read_parquet_as_pandas() can be used if it is not known beforehand whether it is a folder or not.

    0 讨论(0)
提交回复
热议问题