How to convert Spark Streaming data into Spark DataFrame

问题

So far, Spark hasn't created the DataFrame for streaming data, but when I am doing anomalies detection, it is more convenient and faster to use DataFrame for data analysis. I have done this part, but when I try to do real time anomalies detection using streaming data, the problems appeared. I tried several ways and still could not convert DStream to DataFrame, and cannot convert the RDD inside of DStream into DataFrame either.

Here's part of my latest version of the code:

import sys
import re

from pyspark import SparkContext
from pyspark.sql.context import SQLContext
from pyspark.sql import Row
from pyspark.streaming import StreamingContext
from pyspark.mllib.clustering import KMeans, KMeansModel, StreamingKMeans
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.functions import udf
import operator


sc = SparkContext(appName="test")
ssc = StreamingContext(sc, 5)
sqlContext = SQLContext(sc)

model_inputs = sys.argv[1]

def streamrdd_to_df(srdd):
    sdf = sqlContext.createDataFrame(srdd)
    sdf.show(n=2, truncate=False)
    return sdf

def main():
    indata = ssc.socketTextStream(sys.argv[2], int(sys.argv[3]))
    inrdd = indata.map(lambda r: get_tuple(r))
    Features = Row('rawFeatures')
    features_rdd = inrdd.map(lambda r: Features(r))
    features_rdd.pprint(num=3)
    streaming_df = features_rdd.flatMap(streamrdd_to_df)

    ssc.start()
    ssc.awaitTermination()

if __name__ == "__main__":
    main()

As you can see in the main() function, when I am reading the input streaming data using ssc.socketTextStream() method, it generates DStream, then I tried to convert each individual in DStream into Row, hoping I could convert the data into DataFrame later.

If I use ppprint() to print out features_rdd here, it works, which makes me think, each individual in features_rdd is a batch of RDD while the whole features_rdd is a DStream.

Then I created streamrdd_to_df() method and hoped to convert each batch of RDD into dataframe, it gives me the error, showing:

ERROR StreamingContext: Error starting the context, marking it as stopped java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute

Is there any thought about how can I do DataFrame operations on Spark streaming data?

回答1:

Spark has provided us with structured streaming which can solve such problems. It can generate streaming DataFrame i.e DataFrames being appended continuously. Please check below link

http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

回答2:

Read the Error carefully..It says there is No output operations registered. Spark is Lazy and executes the job/ cod only when it has something to produce as a result. In your program there is no "Output Operation" and same is being complained by Spark.

Define a foreach() or Raw SQL Query over the DataFrame and then print the results. It will work fine.

回答3:

Why don't you use something like this:

def socket_streamer(sc): # retruns a streamed dataframe
    streamer = session.readStream\
        .format("socket") \
        .option("host", "localhost") \
        .option("port", 9999) \
        .load()
    return streamer

The output itself of this function above (or the readStream in general) is a DataFrame. There you don't need to worry about df, it is already automatically created by spark. See the Spark Structured Streaming Programming Guide

回答4:

After 1 year, I started to explore Spark 2.0 streaming methods and finally solved my anomalies detection problem. Here's my code in IPython, you can also find how does my raw data input look like

回答5:

There is no need to convert DStream into RDD. By definition DStream is a collection of RDD. Just use DStream's method foreach() to loop over each RDD and take action.

val conf = new SparkConf()
  .setAppName("Sample")
val spark = SparkSession.builder.config(conf).getOrCreate()
sampleStream.foreachRDD(rdd => {
    val sampleDataFrame = spark.read.json(rdd)
}

回答6:

With Spark 2.3 / Python 3 / Scala 2.11 (Using databricks) I was able to use temporary tables and a code snippet in scala (using magic in notebooks):

Python Part:

ddf.createOrReplaceTempView("TempItems")

Then on a new cell:

%scala
import java.sql.DriverManager
import org.apache.spark.sql.ForeachWriter

// Create the query to be persisted...
val tempItemsDF = spark.sql("SELECT field1, field2, field3 FROM TempItems")

val itemsQuery = tempItemsDF.writeStream.foreach(new ForeachWriter[Row] 
{      
  def open(partitionId: Long, version: Long):Boolean = {
    // Initializing DB connection / etc...
  }

  def process(value: Row): Unit = {
    val field1 = value(0)
    val field2 = value(1)
    val field3 = value(2)

    // Processing values ...
  }

  def close(errorOrNull: Throwable): Unit = {
    // Closing connections etc...
  }
})

val streamingQuery = itemsQuery.start()

来源：https://stackoverflow.com/questions/35245648/how-to-convert-spark-streaming-data-into-spark-dataframe

标签

python

pyspark

spark-streaming