Cartesian of DStream | 易学教程

问题

I use Spark cartesian function to to generate a list N pairs of values.

I then map over these values to generate a distance metric between each of the users :

val cartesianUsers: org.apache.spark.rdd.RDD[(distance.classes.User, distance.classes.User)] = users.cartesian(users)
cartesianUsers.map(m => manDistance(m._1, m._2))

This works as expected.

Using Spark Streaming library I create a DStream and then map over it :

val customReceiverStream: ReceiverInputDStream[String] = ssc.receiverStream....
customReceiverStream.foreachRDD(m => {
  println("size is " + m)
})

I could use cartesian function within customReceiverStream.foreachRDD but according to doc http://spark.apache.org/docs/1.2.0/streaming-programming-guide.htm this is not its intended use :

foreachRDD(func) The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to a external system, like saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.

How to compute the cartesian of a DStream ? Perhaps I'm misunderstanding the use of DStreams ?

回答1:

I wasn't aware of transform method :

cartesianUsers.transform(car => car.cartesian(car))

Nice talk which also mentions transform function at approx 17:00 https://www.youtube.com/watch?v=g171ndOHgJ0

来源：https://stackoverflow.com/questions/29034825/cartesian-of-dstream

标签

apache-spark

dstream