How to create a Spark Dataset from an RDD

后端 未结 1 425
-上瘾入骨i
-上瘾入骨i 2021-02-04 04:18

I have an RDD[LabeledPoint] intended to be used within a machine learning pipeline. How do we convert that RDD to a DataSet? Note the ne

1条回答
  •  伪装坚强ぢ
    2021-02-04 04:41

    Here is an answer that traverses an extra step - the DataFrame. We use the SQLContext to create a DataFrame and then create a DataSet using the desired object type - in this case a LabeledPoint:

    val sqlContext = new SQLContext(sc)
    val pointsTrainDf =  sqlContext.createDataFrame(training)
    val pointsTrainDs = pointsTrainDf.as[LabeledPoint]
    

    Update Ever heard of a SparkSession ? (neither had I until now..)

    So apparently the SparkSession is the Preferred Way (TM) in Spark 2.0.0 and moving forward. Here is the updated code for the new (spark) world order:

    Spark 2.0.0+ approaches

    Notice in both of the below approaches (simpler one of which credit @zero323) we have accomplished an important savings as compared to the SQLContext approach: no longer is it necessary to first create a DataFrame.

    val sparkSession =  SparkSession.builder().getOrCreate()
    val pointsTrainDf =  sparkSession.createDataset(training)
    val model = new LogisticRegression()
       .train(pointsTrainDs.as[LabeledPoint])
    

    Second way for Spark 2.0.0+ Credit to @zero323

    val spark: org.apache.spark.sql.SparkSession = ???
    import spark.implicits._
    
    val trainDs = training.toDS()
    

    Traditional Spark 1.X and earlier approach

    val sqlContext = new SQLContext(sc)  // Note this is *deprecated* in 2.0.0
    import sqlContext.implicits._
    val training = splits(0).cache()
    val test = splits(1)
    val trainDs = training**.toDS()**
    

    See also: How to store custom objects in Dataset? by the esteemed @zero323 .

    0 讨论(0)
提交回复
热议问题