Difference between DataFrame, Dataset, and RDD in Spark

后端 未结 15 1231
慢半拍i
慢半拍i 2020-11-22 15:53

I\'m just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row]

15条回答
  •  清酒与你
    2020-11-22 16:22

    All great answer and using each API has some trade off. Dataset is built to be super API to solve lot of problem but many times RDD still works best if you understand your data and if processing algorithm is optimized to do lot of things in Single pass to large data then RDD seems to best option.

    Aggregation using dataset API still consume memory and will get better over time.

提交回复
热议问题