Spark Streaming Filtering the Streaming data

问题

I am trying to filter the Streaming Data, and based on the value of the id column i want to save the data to different tables

i have two tables

testTable_odd (id,data1,data2)
testTable_even (id,data1)

if the id value is odd then i want to save record to testTable_odd table and if the value is even then i want to save record to testTable_even.

the tricky part here is my two tables has different columns. tried multiple ways, considered Scala functions with return type Either[obj1,obj2], but i wasn't able to succeed, any pointers would be greatly appreciated.

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SaveMode
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.kafka.KafkaUtils
import com.datastax.spark.connector._

import kafka.serializer.StringDecoder
import org.apache.spark.rdd.RDD
import com.datastax.spark.connector.SomeColumns
import java.util.Formatter.DateTime

object StreamProcessor extends Serializable {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("StreamProcessor")
      .set("spark.cassandra.connection.host", "127.0.0.1")

    val sc = new SparkContext(sparkConf)

    val ssc = new StreamingContext(sc, Seconds(2))

    val sqlContext = new SQLContext(sc)

    val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")

    val topics = args.toSet

    val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
      ssc, kafkaParams, topics)


        stream
  .map { 
    case (_, msg) => 
      val result = msgParseMaster(msg)
      (result.id, result.data)
   }.foreachRDD(rdd => if (!rdd.isEmpty)     rdd.saveToCassandra("testKS","testTable",SomeColumns("id","data")))

      }
    }

    ssc.start()
    ssc.awaitTermination()

  }

  import org.json4s._
  import org.json4s.native.JsonMethods._
  case class wordCount(id: Long, data1: String, data2: String) extends serializable
  implicit val formats = DefaultFormats
  def msgParseMaster(msg: String): wordCount = {
    val m = parse(msg).extract[wordCount]
    return m

  }

}

回答1:

I think that you just want to use the filter function twice. You can do something like

val evenstream = stream.map { case (_, msg) => 
  val result = msgParseMaster(msg)
  (result.id, result.data)
}.filter{ k =>
  k._1 % 2 == 0
}

evenstream.foreachRDD{rdd=>
  //Do something with even stream
}

val oddstream = stream.map { case (_, msg) => 
  val result = msgParseMaster(msg)
  (result.id, result.data)
}.filter{ k =>
  k._1 % 2 == 1
}

oddstream.foreachRDD{rdd=>
  //Do something with odd stream
}

When I did something similar to this on a project here I used the filter function twice if you look down near line 191. In that I was classifying and saving tuples based on their value between 0 and 1, so feel free to check that out.

回答2:

I have performed below steps . 1) extracted the details from raw JSON String and with case class 2) created the super JSON (which has details required for both the filter criteria) 3) converted that JSON into DataFrame 4) performed the select and where clause on that JSON 5) save to Cassandra

来源：https://stackoverflow.com/questions/38811434/spark-streaming-filtering-the-streaming-data

标签

apache-spark

spark-streaming

spark-cassandra-connector