I am trying to process some XML data received on a JMS queue (QPID) using Spark Streaming. After getting xml as DStream I convert them to Dataframes so I can join them with some
so does that mean all processing logic will only run on Driver and not get distributed to workers/executors.
No, the function itself runs on the driver, but don't forget that it operates on an RDD
. The inner functions that you'll use on the RDD
, such as foreachPartition
, map
, filter
etc will still run on the worker nodes. This won't cause all the data to be sent back over the network to the driver, unless you call methods like collect
, which do.