问题
I want to count the distinct value of some type of IDs represented as an RDD.
In the non-streaming case, it's fairly straightforward. Say IDs
is an RDD of IDs read from a flat file.
print ("number of unique IDs %d" % (IDs.distinct().count()))
But I can't seem to do the same thing in the streaming case. Say we have streamIDs
be a DStream
of IDs read from the network.
print ("number of unique IDs from stream %d" % (streamIDs.distinct().count()))
Gives me this error
AttributeError: 'TransformedDStream' object has no attribute 'distinct'
What am I doing wrong? How do I printout the number of distinct IDs that showed up during this batch?
回答1:
With RDDs you have a single result, but with DStreams you have a series of results with a result per micro batch. So you cannot print the number of unique ids once, but instead you have to register an action to print the unique ids for each micro batch, which is a RDD on which you can use distinct:
streamIDs.foreachRDD(rdd => println(rdd.distinct().count()))
Remember you can use window
to create a transformed dstream with bigger batches:
streamIDs.window(Duration(1000)).foreachRDD(rdd => println(rdd.distinct().count()))
回答2:
Have you tried using:
yourDStream.transform(r => r.distinct())
来源:https://stackoverflow.com/questions/32573962/spark-streaming-dstream-does-not-have-distinct