How to save a huge pandas dataframe to hdfs?

前端 未结 4 843
死守一世寂寞
死守一世寂寞 2020-12-05 08:55

Im working with pandas and with spark dataframes. The dataframes are always very big (> 20 GB) and the standard spark functions are not sufficient for those sizes. Currently

相关标签:
4条回答
  • 2020-12-05 09:06

    From https://issues.apache.org/jira/browse/SPARK-6235

    Support for parallelizing R data.frame larger than 2GB

    is resolved.

    From https://pandas.pydata.org/pandas-docs/stable/r_interface.html

    Converting DataFrames into R objects

    you can convert a pandas dataframe to an R data.frame

    So perhaps the transformation pandas -> R -> Spark -> hdfs?

    0 讨论(0)
  • 2020-12-05 09:06

    One other way is to convert your pandas dataframe to spark dataframe (using pyspark) and saving it to hdfs with save command. example

        df = pd.read_csv("data/as/foo.csv")
        df[['Col1', 'Col2']] = df[['Col2', 'Col2']].astype(str)
        sc = SparkContext(conf=conf)
        sqlCtx = SQLContext(sc)
        sdf = sqlCtx.createDataFrame(df)
    
    

    Here astype changes type of your column from object to string. This saves you from otherwise raised exception as spark couldn't figure out pandas type object. But make sure these columns really are of type string.

    Now to save your df in hdfs:

        sdf.write.csv('mycsv.csv')
    
    0 讨论(0)
  • 2020-12-05 09:11

    An hack could be to create N pandas dataframes (each less than 2 GB) (horizontal partitioning) from the big one and create N different spark dataframes, then merging (Union) them to create a final one to write into HDFS. I am assuming that your master machine is powerful but you also have available a cluster in which you are running Spark.

    0 讨论(0)
  • 2020-12-05 09:12

    Meaning having a pandas dataframe which I transform to spark with the help of pyarrow.

    pyarrow.Table.fromPandas is the function your looking for:

    Table.from_pandas(type cls, df, bool timestamps_to_ms=False, Schema schema=None, bool preserve_index=True)
    
    Convert pandas.DataFrame to an Arrow Table
    
    import pyarrow as pa
    
    pdf = ...  # type: pandas.core.frame.DataFrame
    adf = pa.Table.from_pandas(pdf)  # type: pyarrow.lib.Table
    

    The result can be written directly to Parquet / HDFS without passing data via Spark:

    import pyarrow.parquet as pq
    
    fs  = pa.hdfs.connect()
    
    with fs.open(path, "wb") as fw
        pq.write_table(adf, fw)
    

    See also

    • @WesMcKinney answer to read a parquet files from HDFS using PyArrow.
    • Reading and Writing the Apache Parquet Format in the pyarrow documentation.
    • Native Hadoop file system (HDFS) connectivity in Python

    Spark notes:

    Furthermore since Spark 2.3 (current master) Arrow is supported directly in createDataFrame (SPARK-20791 - Use Apache Arrow to Improve Spark createDataFrame from Pandas.DataFrame). It uses SparkContext.defaultParallelism to compute number of chunks so you can easily control the size of individual batches.

    Finally defaultParallelism can be used to control number of partitions generated using standard _convert_from_pandas, effectively reducing size of the slices to something more manageable.

    Unfortunately these are unlikely to resolve your current memory problems. Both depend on parallelize, therefore store all data in memory of the driver node. Switching to Arrow or adjusting configuration can only speedup the process or address block size limitations.

    In practice I don't see any reason to switch to Spark here, as long as you use local Pandas DataFrame as the input. The most severe bottleneck in this scenario is driver's network I/O and distributing data won't address that.

    0 讨论(0)
提交回复
热议问题