How does Spark DataFrame handles Pandas DataFrame that is larger than memory

后端 未结 1 1650
臣服心动
臣服心动 2020-12-18 10:33

I am learning Spark now, and it seems to be the big data solution for Pandas Dataframe, but I have this question which makes me unsure.

Currently I am storing Pandas

相关标签:
1条回答
  • 2020-12-18 10:39

    the dataframe must be able to fit in memory, and once loaded as a Spark dataframe, Spark will distribute the dataframe to the different workers to do the distributed processing.

    This is true only if you're trying to load your data on a driver and then parallelize. In a typical scenario you store data in a format which can be read in parallel. It means your data:

    • has to be accessible on each worker, for example using distributed file system
    • file format has to support splitting (the simplest examples is plain old csv)

    In situation like this each worker reads only its own part of the dataset without any need to store data in a driver memory. All logic related to computing splits is handled transparently by the applicable Hadoop Input Format.

    Regarding HDF5 files you have two options:

    • read data in chunks on a driver, build Spark DataFrame from each chunk, and union results. This is inefficient but easy to implement
    • distribute HDF5 file / files and read data directly on workers. This generally speaking harder to implement and requires a smart data distribution strategy
    0 讨论(0)
提交回复
热议问题