Issue while storing data from Spark-Streaming to Cassandra

后端 未结 2 933
终归单人心
终归单人心 2020-12-02 02:37

SparkStreaming context reading a stream from RabbitMQ with an interval of 30 seconds. I want to modify the values of few columns of corresponding r

相关标签:
2条回答
  • 2020-12-02 02:53

    Try with x.sparkContext.cassandraTable() instead of ssc.cassandraTable() and see if it helps

    0 讨论(0)
  • 2020-12-02 03:16

    The SparkContext cannot be serialized and passed across multiple workers in possibly different nodes. If you need to do something like this you could use forEachPartiion, mapPartitons. Else do this withing your function that gets passed around

     CassandraConnector(SparkWriter.conf).withSessionDo { session =>
      ....
        session.executeAsync(<CQL Statement>)
    

    and in the SparkConf you need to give the Cassandra details

      val conf = new SparkConf()
        .setAppName("test")
        .set("spark.ui.enabled", "true")
        .set("spark.executor.memory", "8g")
        //  .set("spark.executor.core", "4")
        .set("spark.eventLog.enabled", "true")
        .set("spark.eventLog.dir", "/ephemeral/spark-events")
        //to avoid disk space issues - default is /tmp
        .set("spark.local.dir", "/ephemeral/spark-scratch")
        .set("spark.cleaner.ttl", "10000")
        .set("spark.cassandra.connection.host", cassandraip)
        .setMaster("spark://10.255.49.238:7077")
    

    The Java CSCParser is a library that is not serializable. So Spark cannot send it possibly different nodes if you call map or forEach on the RDD. One workaround is using mapPartion, in which case one full Parition will be executed in one SparkNode. Hence it need not serialize for each call.Example

    val rdd_inital_parse = rdd.mapPartitions(pLines).
    
     def pLines(lines: Iterator[String]) = {
        val parser = new CSVParser() ---> Cannot be serialized, will fail if using rdd.map(pLines)
        lines.map(x => parseCSVLine(x, parser.parseLine))
      }
    
    0 讨论(0)
提交回复
热议问题