Spark Streaming - DStream does not have distinct()

问题

I want to count the distinct value of some type of IDs represented as an RDD.

In the non-streaming case, it's fairly straightforward. Say IDs is an RDD of IDs read from a flat file.

    print ("number of unique IDs %d" %  (IDs.distinct().count()))

But I can't seem to do the same thing in the streaming case. Say we have streamIDs be a DStream of IDs read from the network.

    print ("number of unique IDs from stream %d" %  (streamIDs.distinct().count()))

Gives me this error

AttributeError: 'TransformedDStream' object has no attribute 'distinct'

What am I doing wrong? How do I printout the number of distinct IDs that showed up during this batch?

回答1:

With RDDs you have a single result, but with DStreams you have a series of results with a result per micro batch. So you cannot print the number of unique ids once, but instead you have to register an action to print the unique ids for each micro batch, which is a RDD on which you can use distinct:

streamIDs.foreachRDD(rdd => println(rdd.distinct().count()))

Remember you can use window to create a transformed dstream with bigger batches:

streamIDs.window(Duration(1000)).foreachRDD(rdd => println(rdd.distinct().count()))

回答2:

Have you tried using:

yourDStream.transform(r => r.distinct())

来源：https://stackoverflow.com/questions/32573962/spark-streaming-dstream-does-not-have-distinct

标签

apache-spark

spark-streaming

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!