Is foreachRDD executed on the Driver?

前端 未结 2 894
礼貌的吻别
礼貌的吻别 2021-02-13 23:06

I am trying to process some XML data received on a JMS queue (QPID) using Spark Streaming. After getting xml as DStream I convert them to Dataframes so I can join them with some

2条回答
  •  再見小時候
    2021-02-14 00:08

    To make this clear, if you run the following, you will see "monkey" on the driver's stdout:

    myDStream.foreachRDD { rdd =>
      println("monkey")
    }
    

    If you run the following, you will see "monkey" on the driver's stdout, and the filter work will be done on whatever executors the rdd is distributed across:

    myDStream.foreachRDD { rdd =>
      println("monkey")
      rdd.filter(element => element == "Save me!")
    }
    

    Let's add the simplification that myDStream only ever receives one RDD, and that this RDD is spread across a set of partitions that we'll call PartitionSetA that exist on MachineSetB where ExecutorSetC are running. If you run the following, you will see "monkey" on the driver's stdout, you will see "turtle" on the stdouts of all executors in ExecutorSetC ("turtle" will appear once for each partition -- many partitions could be on the machine where an executor is running), and the work of both the filter and addition operations will be done across ExecutorSetC:

    myDStream.foreachRDD { rdd =>
      println("monkey")
      rdd.filter(element => element == "Save me!")
      rdd.foreachPartition { partition =>
        println("turtle")
        val x = 1 + 1
      }
    }
    

    One more thing to note is that in the following code, y would end up being sent across the network from the driver to all of ExecutorSetC for each rdd:

    val y = 2
    myDStream.foreachRDD { rdd =>
      println("monkey")
      rdd.filter(element => element == "Save me!")
      rdd.foreachPartition { partition =>
        println("turtle")
        val x = 1 + 1
        val z = x + y
      }
    }
    

    To avoid this overhead, you can use broadcast variables, which send the value from the driver to the executors just once. For example:

    val y = 2
    val broadcastY = sc.broadcast(y)
    myDStream.foreachRDD { rdd =>
      println("monkey")
      rdd.filter(element => element == "Save me!")
      rdd.foreachPartition { partition =>
        println("turtle")
        val x = 1 + 1
        val z = x + broadcastY.value
      }
    }
    

    For sending more complex things over as broadcast variables, such as objects that aren't easily serializable once instantiated, you can see the following blog post: https://allegro.tech/2015/08/spark-kafka-integration.html

提交回复
热议问题