Reading schema of streaming Dataframe in Spark Structured Streaming [duplicate]

问题

I'm new with Apache Spark Structured Streaming. I'm trying to read some events from Event Hub (in XML format) and trying to create new Spark DF from the nested XML.

Im using the code example described in https://github.com/databricks/spark-xml and in batch mode is running perfectly but not in Structured Spark Streaming.

Code chunk of spark-xml Github library

import com.databricks.spark.xml.functions.from_xml
import com.databricks.spark.xml.schema_of_xml
import spark.implicits._
val df = ... /// DataFrame with XML in column 'payload' 
val payloadSchema = schema_of_xml(df.select("payload").as[String])
val parsed = df.withColumn("parsed", from_xml($"payload", payloadSchema))

My batch code

val df = Seq(
  (8, "<AccountSetup xmlns:xsi=\"test\"><Customers test=\"a\">d</Customers><tag1>7</tag1> <tag2>4</tag2> <mode>0</mode> <Quantity>1</Quantity></AccountSetup>"),
  (64, "<AccountSetup xmlns:xsi=\"test\"><Customers test=\"a\">d</Customers><tag1>6</tag1> <tag2>4</tag2>  <mode>0</mode> <Quantity>1</Quantity></AccountSetup>"),
  (27, "<AccountSetup xmlns:xsi=\"test\"><Customers test=\"a\">d</Customers><tag1>4</tag1> <tag2>4</tag2> <mode>3</mode> <Quantity>1</Quantity></AccountSetup>")
).toDF("number", "body")
)


val payloadSchema = schema_of_xml(df.select("body").as[String])
val parsed = df.withColumn("parsed", from_xml($"body", payloadSchema))

val final_df = parsed.select(parsed.col("parsed"))
display(final_df.select("parsed.*"))

I was trying to do same logic for Spark Structured Streaming like the following code:

Structured Streaming code

import com.databricks.spark.xml.functions.from_xml
import com.databricks.spark.xml.schema_of_xml
import org.apache.spark.eventhubs.{ ConnectionStringBuilder, EventHubsConf, EventPosition }
import spark.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._


val streamingInputDF = 
  spark.readStream
    .format("eventhubs")
    .options(eventHubsConf.toMap)
    .load()

val payloadSchema = schema_of_xml(streamingInputDF.select("body").as[String])
val parsed = streamingSelectDF.withColumn("parsed", from_xml($"body", payloadSchema))
val final_df = parsed.select(parsed.col("parsed"))

display(final_df.select("parsed.*"))

In code part of val payloadSchema = schema_of_xml(streamingInputDF.select("body").as[String]) intstruction throws the error Queries with streaming sources must be executed with writeStream.start();;

Update

Tried to


val streamingInputDF = 
  spark.readStream
    .format("eventhubs")
    .options(eventHubsConf.toMap)
    .load()
    .select(($"body").cast("string"))

val body_value = streamingInputDF.select("body").as[String]
body_value.writeStream
    .format("console")
    .start()

spark.streams.awaitAnyTermination()


val payloadSchema = schema_of_xml(body_value)
val parsed = body_value.withColumn("parsed", from_xml($"body", payloadSchema))
val final_df = parsed.select(parsed.col("parsed"))

Now is not running into the error but Databricks stay in "Waiting status"

Thanks!!

回答1:

There is nothing wrong with your code if it works in batch mode.

It is important to not only convert the source into a stream (by using readStream and load) but it is also required to convert the sink part into a stream.

The error message you are getting is just reminding you to also look into the sink part. Your Dataframe final_df is actually a streaming Dataframe which has to be started through start.

The Structured Streaming Guide gives you a good overview on all available Output Sinks and the easiest would be to print the result to the console.

To summarize, you need to add the following to your program:

final_df.writeStream
    .format("console")
    .start()

spark.streams.awaitAnyTermination()

来源：https://stackoverflow.com/questions/65832782/reading-schema-of-streaming-dataframe-in-spark-structured-streaming

标签

xml

scala

apache-spark

spark-structured-streaming

azure-eventhub