How to use foreachRDD in legacy Spark Streaming

问题

I am getting exception while using foreachRDD for my CSV data processing. Here is my code

  case class Person(name: String, age: Long)
  val conf = new SparkConf()
  conf.setMaster("local[*]")
  conf.setAppName("CassandraExample").set("spark.driver.allowMultipleContexts", "true")
  val ssc = new StreamingContext(conf, Seconds(10))
  val smDstream=ssc.textFileStream("file:///home/sa/testFiles")

  smDstream.foreachRDD((rdd,time) => {
  val peopleDF = rdd.map(_.split(",")).map(attributes => 
  Person(attributes(0), attributes(1).trim.toInt)).toDF()
  peopleDF.createOrReplaceTempView("people")
  val teenagersDF = spark.sql("insert into table devDB.stam SELECT name, age 
  FROM people WHERE age BETWEEN 13 AND 29")
  //teenagersDF.show  
    })
  ssc.checkpoint("hdfs://go/hive/warehouse/devDB.db")
  ssc.start()

i am getting following error java.io.NotSerializableException: DStream checkpointing has been enabled but the DStreams with their functions are not serializable org.apache.spark.streaming.StreamingContext Serialization stack: - object not serializable (class: org.apache.spark.streaming.StreamingContext, value: org.apache.spark.streaming.StreamingContext@1263422a) - field (class: $iw, name: ssc, type: class org.apache.spark.streaming.StreamingContext)

please help

回答1:

The question does not really make sense anymore in that dStreams are being deprecated / abandoned.

There a few things to consider in the code, what the exact question is therefore hard to glean. That said, I had to ponder as well as I am not a Serialization expert.

You can find a few posts of some trying to write to Hive table directly as opposed to a path, in my answer I use an approach but you can use your approach of Spark SQL to write for a TempView, that is all possible.

I simulated input from a QueueStream, so I need no split to be applied. You can adapt this to your own situation if you follow the same "global" approach. I elected to write to a parquet file that gets created if needed. You can create your tempView and then use spark.sql as per your initial approach.

The Output Operations on DStreams are:

print()

saveAsTextFiles(prefix, [suffix])

saveAsObjectFiles(prefix, [suffix])

saveAsHadoopFiles(prefix, [suffix])

foreachRDD(func)

foreachRDD

The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.

It states saving to files, but it can do what you want via foreachRDD, albeit I assumed the idea was to external systems. Saving to files is quicker in my view as opposed to going through steps to write a table directly. You want to offload data asap with Streaming as volumes are typically high.

Two steps:

In a separate class to the Streaming Class - run under Spark 2.4:

case class Person(name: String, age: Int)

Then the Streaming logic you need to apply - you may need some imports that I have in my notebook otherwise as I ran this under DataBricks:

import org.apache.spark.sql.SparkSession
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scala.collection.mutable
import org.apache.spark.sql.SaveMode

val spark = SparkSession
           .builder
           .master("local[4]")
           .config("spark.driver.cores", 2)
           .appName("forEachRDD")
           .getOrCreate()

val sc = spark.sparkContext
val ssc = new StreamingContext(spark.sparkContext, Seconds(1)) 

val rddQueue = new mutable.Queue[RDD[List[(String, Int)]]]()
val QS = ssc.queueStream(rddQueue) 

QS.foreachRDD(q => {
   if(!q.isEmpty) {   
      val q_flatMap = q.flatMap{x=>x}
      val q_withPerson = q_flatMap.map(field => Person(field._1, field._2))
      val df = q_withPerson.toDF()      

      df.write
        .format("parquet")
        .mode(SaveMode.Append)
        .saveAsTable("SO_Quest_BigD")
   }
 }
)

ssc.start()
for (c <- List(List(("Fred",53), ("John",22), ("Mary",76)), List(("Bob",54), ("Johnny",92), ("Margaret",15)), List(("Alfred",21), ("Patsy",34), ("Sylvester",7)) )) {
   rddQueue += ssc.sparkContext.parallelize(List(c))
} 
ssc.awaitTermination()

来源：https://stackoverflow.com/questions/54021043/how-to-use-foreachrdd-in-legacy-spark-streaming

标签

apache-spark

spark-streaming