Spark Streaming Filtering the Streaming data

余生颓废 提交于 2019-12-24 06:48:44

问题


I am trying to filter the Streaming Data, and based on the value of the id column i want to save the data to different tables

i have two tables

  1. testTable_odd (id,data1,data2)
  2. testTable_even (id,data1)

if the id value is odd then i want to save record to testTable_odd table and if the value is even then i want to save record to testTable_even.

the tricky part here is my two tables has different columns. tried multiple ways, considered Scala functions with return type Either[obj1,obj2], but i wasn't able to succeed, any pointers would be greatly appreciated.

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SaveMode
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.kafka.KafkaUtils
import com.datastax.spark.connector._

import kafka.serializer.StringDecoder
import org.apache.spark.rdd.RDD
import com.datastax.spark.connector.SomeColumns
import java.util.Formatter.DateTime

object StreamProcessor extends Serializable {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("StreamProcessor")
      .set("spark.cassandra.connection.host", "127.0.0.1")

    val sc = new SparkContext(sparkConf)

    val ssc = new StreamingContext(sc, Seconds(2))

    val sqlContext = new SQLContext(sc)

    val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")

    val topics = args.toSet

    val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
      ssc, kafkaParams, topics)


        stream
  .map { 
    case (_, msg) => 
      val result = msgParseMaster(msg)
      (result.id, result.data)
   }.foreachRDD(rdd => if (!rdd.isEmpty)     rdd.saveToCassandra("testKS","testTable",SomeColumns("id","data")))

      }
    }

    ssc.start()
    ssc.awaitTermination()

  }

  import org.json4s._
  import org.json4s.native.JsonMethods._
  case class wordCount(id: Long, data1: String, data2: String) extends serializable
  implicit val formats = DefaultFormats
  def msgParseMaster(msg: String): wordCount = {
    val m = parse(msg).extract[wordCount]
    return m

  }

}

回答1:


I think that you just want to use the filter function twice. You can do something like

val evenstream = stream.map { case (_, msg) => 
  val result = msgParseMaster(msg)
  (result.id, result.data)
}.filter{ k =>
  k._1 % 2 == 0
}

evenstream.foreachRDD{rdd=>
  //Do something with even stream
}

val oddstream = stream.map { case (_, msg) => 
  val result = msgParseMaster(msg)
  (result.id, result.data)
}.filter{ k =>
  k._1 % 2 == 1
}

oddstream.foreachRDD{rdd=>
  //Do something with odd stream
}

When I did something similar to this on a project here I used the filter function twice if you look down near line 191. In that I was classifying and saving tuples based on their value between 0 and 1, so feel free to check that out.




回答2:


I have performed below steps . 1) extracted the details from raw JSON String and with case class 2) created the super JSON (which has details required for both the filter criteria) 3) converted that JSON into DataFrame 4) performed the select and where clause on that JSON 5) save to Cassandra



来源:https://stackoverflow.com/questions/38811434/spark-streaming-filtering-the-streaming-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!