Why is Apache-Spark - Python so slow locally as compared to pandas?

后端 未结 1 1292
陌清茗
陌清茗 2020-11-28 13:10

A Spark newbie here. I recently started playing around with Spark on my local machine on two cores by using the command:

pyspark --master local[2]

相关标签:
1条回答
  • 2020-11-28 13:54

    Because:

    • Apache Spark is a complex framework designed to distribute processing across hundreds of nodes, while ensuring correctness and fault tolerance. Each of these properties has significant cost.
    • Because purely in-memory in-core processing (Pandas) is orders of magnitude faster than disk and network (even local) I/O (Spark).
    • Because parallelism (and distributed processing) add significant overhead, and even with optimal (embarrassingly parallel workload) does not guarantee any performance improvements.
    • Because local mode is not designed for performance. It is used for testing.
    • Last but not least - 2 cores running on 393MB is not enough to see any performance improvements, and single node doesn't provide any opportunity for distribution
    • Also Spark: Inconsistent performance number in scaling number of cores, Why is pyspark so much slower in finding the max of a column?, Why does my Spark run slower than pure Python? Performance comparison

    You can go on like this for a long time...

    0 讨论(0)
提交回复
热议问题