How to Combine two Dstreams using Pyspark (similar to .zip on normal RDD)

不打扰是莪最后的温柔 提交于 2019-12-13 02:28:16

问题


I know that we can combine(like cbind in R) two RDDs as below in pyspark:

rdd3 = rdd1.zip(rdd2)

I want to perform the same for two Dstreams in pyspark. Is it possible or any alternatives?

In fact, I am using a MLlib randomforest model to predict using spark streaming. In the end, I want to combine the feature Dstream & prediction Dstream together for further downstream processing.

Thanks in advance.

-Obaid


回答1:


In the end, I am using below.

The trick is using "native python map" along with "spark spreaming transform". May not an elegent way, however it works :).

def predictScore(texts, modelRF):
    predictions = texts.map( lambda txt :  (txt , getFeatures(txt)) ).\
     map(lambda (txt, features) : (txt ,(features.split(','))) ).\
     map( lambda (txt, features) : (txt, ([float(i) for i in features])) ).\
     transform( lambda  rdd: sc.parallelize(\
       map( lambda x,y:(x,y), modelRF.predict(rdd.map(lambda (x,y):y)).collect(),rdd.map(lambda (x,y):x).collect() )\
       )\
     )
    # in the transform operation: x=text and y=features
    # Return will be tuple of (score,'original text')
    return predictions

Hope, it will help somebody who is facing same problem. If anybody has better idea, please post it here.

-Obaid

Note: I also submitted the problem on spark user list and post my answer there as well.



来源:https://stackoverflow.com/questions/37466361/how-to-combine-two-dstreams-using-pyspark-similar-to-zip-on-normal-rdd

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!