How do you perform blocking IO in apache spark job?

后端 未结 2 1490
名媛妹妹
名媛妹妹 2021-02-02 02:15

What if, when I traverse RDD, I need to calculate values in dataset by calling external (blocking) service? How do you think that could be achieved?

val values: Fu

2条回答
  •  天涯浪人
    2021-02-02 02:58

    Here is answer to my own question:

    val buckets = sc.textFile(logFile, 100)
    val tasks: RDD[Future[Object]] = buckets map { item =>
      future {
        // call native code
      }
    }
    
    val values = tasks.mapPartitions[Object] { f: Iterator[Future[Object]] =>
      val searchFuture: Future[Iterator[Object]] = Future sequence f
      Await result (searchFuture, JOB_TIMEOUT)
    }
    

    The idea here is, that we get the collection of partitions, where each partition is sent to the specific worker and is the smallest piece of work. Each that piece of work contains data, that could be processed by calling native code and sending that data.

    'values' collection contains the data, that is returned from the native code and that work is done across the cluster.

提交回复
热议问题