Reading schema of streaming Dataframe in Spark Structured Streaming [duplicate]

折月煮酒 提交于 2021-02-04 21:01:35

问题


I'm new with Apache Spark Structured Streaming. I'm trying to read some events from Event Hub (in XML format) and trying to create new Spark DF from the nested XML.

Im using the code example described in https://github.com/databricks/spark-xml and in batch mode is running perfectly but not in Structured Spark Streaming.

Code chunk of spark-xml Github library

import com.databricks.spark.xml.functions.from_xml
import com.databricks.spark.xml.schema_of_xml
import spark.implicits._
val df = ... /// DataFrame with XML in column 'payload' 
val payloadSchema = schema_of_xml(df.select("payload").as[String])
val parsed = df.withColumn("parsed", from_xml($"payload", payloadSchema))

My batch code

val df = Seq(
  (8, "<AccountSetup xmlns:xsi=\"test\"><Customers test=\"a\">d</Customers><tag1>7</tag1> <tag2>4</tag2> <mode>0</mode> <Quantity>1</Quantity></AccountSetup>"),
  (64, "<AccountSetup xmlns:xsi=\"test\"><Customers test=\"a\">d</Customers><tag1>6</tag1> <tag2>4</tag2>  <mode>0</mode> <Quantity>1</Quantity></AccountSetup>"),
  (27, "<AccountSetup xmlns:xsi=\"test\"><Customers test=\"a\">d</Customers><tag1>4</tag1> <tag2>4</tag2> <mode>3</mode> <Quantity>1</Quantity></AccountSetup>")
).toDF("number", "body")
)


val payloadSchema = schema_of_xml(df.select("body").as[String])
val parsed = df.withColumn("parsed", from_xml($"body", payloadSchema))

val final_df = parsed.select(parsed.col("parsed"))
display(final_df.select("parsed.*"))

I was trying to do same logic for Spark Structured Streaming like the following code:

Structured Streaming code

import com.databricks.spark.xml.functions.from_xml
import com.databricks.spark.xml.schema_of_xml
import org.apache.spark.eventhubs.{ ConnectionStringBuilder, EventHubsConf, EventPosition }
import spark.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._


val streamingInputDF = 
  spark.readStream
    .format("eventhubs")
    .options(eventHubsConf.toMap)
    .load()

val payloadSchema = schema_of_xml(streamingInputDF.select("body").as[String])
val parsed = streamingSelectDF.withColumn("parsed", from_xml($"body", payloadSchema))
val final_df = parsed.select(parsed.col("parsed"))

display(final_df.select("parsed.*"))

In code part of val payloadSchema = schema_of_xml(streamingInputDF.select("body").as[String]) intstruction throws the error Queries with streaming sources must be executed with writeStream.start();;

Update

Tried to


val streamingInputDF = 
  spark.readStream
    .format("eventhubs")
    .options(eventHubsConf.toMap)
    .load()
    .select(($"body").cast("string"))

val body_value = streamingInputDF.select("body").as[String]
body_value.writeStream
    .format("console")
    .start()

spark.streams.awaitAnyTermination()


val payloadSchema = schema_of_xml(body_value)
val parsed = body_value.withColumn("parsed", from_xml($"body", payloadSchema))
val final_df = parsed.select(parsed.col("parsed"))

Now is not running into the error but Databricks stay in "Waiting status"

Thanks!!


回答1:


There is nothing wrong with your code if it works in batch mode.

It is important to not only convert the source into a stream (by using readStream and load) but it is also required to convert the sink part into a stream.

The error message you are getting is just reminding you to also look into the sink part. Your Dataframe final_df is actually a streaming Dataframe which has to be started through start.

The Structured Streaming Guide gives you a good overview on all available Output Sinks and the easiest would be to print the result to the console.

To summarize, you need to add the following to your program:

final_df.writeStream
    .format("console")
    .start()

spark.streams.awaitAnyTermination()


来源:https://stackoverflow.com/questions/65832782/reading-schema-of-streaming-dataframe-in-spark-structured-streaming

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!