How to process eventhub stream with pyspark and custom python function

て烟熏妆下的殇ゞ 提交于 2019-12-01 06:15:08

问题


My current setup is:

  • Spark 2.3.0 with pyspark 2.2.1
  • streaming service using Azure IOTHub/EventHub
  • some custom python functions based on pandas, matplotlib, etc

I'm using https://github.com/Azure/azure-event-hubs-spark/blob/master/docs/PySpark/structured-streaming-pyspark-jupyter.md as an example on how to read the data but:

  • can't use foreach sink as that is not implemented in python
  • when i try to call .rdd, .map or .flatMap i get an exception: "Queries with streaming sources must be executed with writeStream.start()"

What is the correct way to get each element of the stream and pass it through a python function?

Thanks,

Ed


回答1:


In the first step you define a dataframe reading the data as a stream from your EventHub or IoT-Hub:

from pyspark.sql.functions import *

df = spark \
  .readStream \
  .format("eventhubs") \
  .options(**ehConf) \
  .load()

The data is stored binary in the body attribute. To get the elements of the body you have to define the structure:

from pyspark.sql.types import *

Schema = StructType([StructField("name", StringType(), True),
                      StructField("dt", LongType(), True),
                      StructField("main", StructType( 
                          [StructField("temp", DoubleType()), 
                           StructField("pressure", DoubleType())])),
                      StructField("coord", StructType( 
                          [StructField("lon", DoubleType()), 
                           StructField("lat", DoubleType())]))
                    ])

and apply the schema on the body casted as a string:

from pyspark.sql.functions import *

rawData = df. \
  selectExpr("cast(Body as string) as json"). \
  select(from_json("json", Schema).alias("data")). \
  select("data.*")

On the resulting dataframe you can apply functions, e. g. call the custom function u_make_hash on the column 'name':

 parsedData=rawData.select('name', u_make_hash(rawData['name']).alias("namehash"))  


来源:https://stackoverflow.com/questions/49365852/how-to-process-eventhub-stream-with-pyspark-and-custom-python-function

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!