Issue while storing data from Spark-Streaming to Cassandra

后端未结

关注

 2  933

SparkStreaming context reading a stream from RabbitMQ with an interval of 30 seconds. I want to modify the values of few columns of corresponding r

相关标签:

2条回答

灰色年华

2020-12-02 02:53

Try with x.sparkContext.cassandraTable() instead of ssc.cassandraTable() and see if it helps

0 讨论(0)
发布评论:

提交评论
- 加载中...

没有蜡笔的小新

2020-12-02 03:16

The SparkContext cannot be serialized and passed across multiple workers in possibly different nodes. If you need to do something like this you could use forEachPartiion, mapPartitons. Else do this withing your function that gets passed around

 CassandraConnector(SparkWriter.conf).withSessionDo { session =>
  ....
    session.executeAsync(<CQL Statement>)

and in the SparkConf you need to give the Cassandra details

  val conf = new SparkConf()
    .setAppName("test")
    .set("spark.ui.enabled", "true")
    .set("spark.executor.memory", "8g")
    //  .set("spark.executor.core", "4")
    .set("spark.eventLog.enabled", "true")
    .set("spark.eventLog.dir", "/ephemeral/spark-events")
    //to avoid disk space issues - default is /tmp
    .set("spark.local.dir", "/ephemeral/spark-scratch")
    .set("spark.cleaner.ttl", "10000")
    .set("spark.cassandra.connection.host", cassandraip)
    .setMaster("spark://10.255.49.238:7077")

The Java CSCParser is a library that is not serializable. So Spark cannot send it possibly different nodes if you call map or forEach on the RDD. One workaround is using mapPartion, in which case one full Parition will be executed in one SparkNode. Hence it need not serialize for each call.Example

val rdd_inital_parse = rdd.mapPartitions(pLines).

 def pLines(lines: Iterator[String]) = {
    val parser = new CSVParser() ---> Cannot be serialized, will fail if using rdd.map(pLines)
    lines.map(x => parseCSVLine(x, parser.parseLine))
  }

0 讨论(0)