Training ml models on spark per partitions. Such that there will be a trained model per partition of dataframe

问题

How to do parallel model training per partition in spark using scala? The solution given here is in Pyspark. I'm looking for solution in scala. How can you efficiently build one ML model per partition in Spark with foreachPartition?

回答1:

Get the distinct partitions using partition col
Create a threadpool of say 100 threads
create future object for each threads and run

sample code may be as follows-

   // Get an ExecutorService 
    val threadPoolExecutorService = getExecutionContext("name", 100)
// check https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/param/shared/HasParallelism.scala#L50

   val uniquePartitionValues: List[String] = ...//getDistingPartitionsUsingPartitionCol
    // Asynchronous invocation to training. The result will be collected from the futures.
    val uniquePartitionValuesFutures = uniquePartitionValues.map(partitionValue => {
      Future[Double] {
        try {
            // get dataframe where partitionCol=partitionValue
            val partitionDF = mainDF.where(s"partitionCol=$partitionValue")
          // do preprocessing and training using any algo with an input partitionDF and return accuracy
        } catch {
          ....
      }(threadPoolExecutorService)
    })

    // Wait for metrics to be calculated
    val foldMetrics = uniquePartitionValuesFutures.map(Await.result(_, Duration.Inf))
    println(s"output::${foldMetrics.mkString("  ###  ")}")

来源：https://stackoverflow.com/questions/61187595/training-ml-models-on-spark-per-partitions-such-that-there-will-be-a-trained-mo

标签

apache-spark

apache-spark-ml

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!