Should we parallelize a DataFrame like we parallelize a Seq before training

前端 未结 2 742
无人及你
无人及你 2021-02-04 10:11

Consider the code given here,

https://spark.apache.org/docs/1.2.0/ml-guide.html

import org.apache.spark.ml.classification.LogisticRegression
val training         


        
2条回答
  •  南笙
    南笙 (楼主)
    2021-02-04 10:31

    You should maybe check out the difference between RDD and DataFrame and how to convert between the two: Difference between DataFrame and RDD in Spark

    To answer your question directly: A DataFrame is already optimized for parallel execution. You do not need to do anything and you can pass it to any spark estimators fit() method directly. The parallel executions are handled in the background.

提交回复
热议问题