I\'m just wondering what is the difference between an RDD
and DataFrame
(Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row]
All great answer and using each API has some trade off. Dataset is built to be super API to solve lot of problem but many times RDD still works best if you understand your data and if processing algorithm is optimized to do lot of things in Single pass to large data then RDD seems to best option.
Aggregation using dataset API still consume memory and will get better over time.