问题
I am using Apache Spark streaming to do some real-time processing of my web service API logs. The source stream is just a series of API calls with return code. And my Spark app is mainly doing aggregation over the raw API call logs, counting how many API returning certain HTTP code.
The batch interval on the source stream is 1 seconds. Then I do :
inputStream.reduceByKey(_ + _) where inputStream is of type DStream[(String, Int)].
And now I get the result DStream level1
. Then I do reduceByKeyAndWindow
on level1
over 60 seconds by calling
val level2 = level1.reduceByKeyAndWindow((a: Int, b: Int) => a + b, Seconds(60), Seconds(60))
Then I want to do further aggregation (say level 3
) over longer period (say 3600 seconds) on top of DStream level2
by calling
val level3 = level2.reduceByKeyAndWindow((a: Int, b: Int) => a + b, Seconds(3600), Seconds(3600))
My problem now is: I only get aggregated data on level2
, while level3
is empty.
My understanding is that level3
should not be empty and it should aggregate over level 2
stream.
Of course I can change to let level3
aggregate over level1
, instead of level2
. But I don't understand why it is not working by aggregating over level2
.
It seems to me that you can only do one layer of reduceByKeyAndWindow
on the source stream. Any further layers of reduceByKeyAndWindow
on top of previous streams reduced by key and window won't work.
Any ideas?
回答1:
Yes, I think it should be a bug in Spark Streaming. Seems the Window operation of windowed stream does not work. Now I'm also investigating the reason. Will keep updated for any findings.
Similar Question: indows of windowed streams not displaying the expected results
来源:https://stackoverflow.com/questions/29961925/spark-streaming-dstream-reducebykeyandwindow-doesnt-work