How can I parallelize multiple Datasets in Spark?

后端 未结 1 642
花落未央
花落未央 2020-12-21 14:05

I have a Spark 2.1 job where I maintain multiple Dataset objects/RDD\'s that represent different queries over our underlying Hive/HDFS datastore. I\'ve noticed that if I si

相关标签:
1条回答
  • 2020-12-21 14:19

    Yes you can use multithreading in the driver code, but normally this does not increase performance, unless your queries operate on very skewed data and/or cannot be parallelized well enough to fully utilize the resources.

    You can do something like that:

    val datasets : Seq[Dataset[_]] = ???
    
    datasets
      .par // transform to parallel Seq
      .foreach(ds => ds.write.saveAsTable(...) 
    
    0 讨论(0)
提交回复
热议问题