问题
I need to calculate a moving average from a kinesis stream of data. I will have a sliding window size and slide as inputs and need to calculate the moving average and plot it.
I understand how to use reduceByKeyAndWindow from the docs to get a rolling sum. I understand how to get the counts per window as well. I am not clear on how to use these to get the average. Nor am I sure how to define an average calculator function in the reduceByKeyAndWindow. Any help would be appreciated.
Sample code below,
def createContext():
sc = SparkContext(appName="PythonSparkStreaming")
sc.setLogLevel("ERROR")
ssc = StreamingContext(sc, 5)
ssc.setLogLeve("ERROR")
# Define kinesis Consumer
kinesisStream = KinesisUtils.createStream(ssc,
appName,
streamName,
endpointUrl,
regionName,
InitialPositionInStream.LATEST,
10)
# Count number of tweets in a batch
count_this_batch = kinesisStream.count().map(lambda x: ('Count this batch: %s' % x))
# Count by windowed time period
count_windowed = kinesisStream.countByWindow(60, 5).map(lambda x: ('Counts total (One minute rolling count): %s' % x))
sum_window = kafkaStream.reduceByKeyAndWindow(lambda x, y: x + y, lambda x, y: x - y, 60, 5)
return ssc
ssc = StreamingContext.getOrCreate('/tmp/checkpoint_v06', lambda: createContext())
ssc.start()
ssc.awaitTermination()
来源:https://stackoverflow.com/questions/51838194/spark-streaming-reducebykeyandwindow-for-moving-average-calculation