Handle database connection inside spark streaming

问题

I am not sure if I understand correctly how spark handle database connection and how to reliable using large number of database update operation insides spark without potential screw up the spark job. This is a code snippet I have been using (for easy illustration):

val driver = new MongoDriver
val hostList: List[String] = conf.getString("mongo.hosts").split(",").toList
val connection = driver.connection(hostList)
val mongodb = connection(conf.getString("mongo.db"))
val dailyInventoryCol = mongodb[BSONCollection](conf.getString("mongo.collections.dailyInventory"))

val stream: InputDStream[(String,String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, String)](
  ssc, kafkaParams, fromOffsets,
  (mmd: MessageAndMetadata[String, String]) => (mmd.topic, mmd.message()));

def processRDD(rddElem: RDD[(String, String)]): Unit = {
    val df = rdd.map(line => {
        ...
    }).flatMap(x => x).toDF()

    if (!isEmptyDF(df)) {
        var mongoF: Seq[Future[dailyInventoryCol.BatchCommands.FindAndModifyCommand.FindAndModifyResult]] = Seq();

        val dfF2 = df.groupBy($"CountryCode", $"Width", $"Height", $"RequestType", $"Timestamp").agg(sum($"Frequency")).collect().map(row => {
        val countryCode = row.getString(0); val width = row.getInt(1); val height = row.getInt(2);
        val requestType = row.getInt(3); val timestamp = row.getLong(4); val frequency = row.getLong(5);
        val endTimestamp = timestamp + 24*60*60; //next day

        val updateOp = dailyInventoryCol.updateModifier(BSONDocument("$inc" -> BSONDocument("totalFrequency" -> frequency)), false, true)

        val f: Future[dailyInventoryCol.BatchCommands.FindAndModifyCommand.FindAndModifyResult] =
        dailyInventoryCol.findAndModify(BSONDocument("width" -> width, "height" -> height, "country_code" -> countryCode, "request_type" -> requestType,
         "startTs" -> timestamp, "endTs" -> endTimestamp), updateOp) 

        f
   })

   mongoF = mongoF ++ dfF2

   //split into small chunk to avoid drying out the mongodb connection
   val futureList: List[Seq[Future[dailyInventoryCol.BatchCommands.FindAndModifyCommand.FindAndModifyResult]]] = mongoF.grouped(200).toList

   //future list
   futureList.foreach(seqF => {
     Await.result(Future.sequence(seqF), 40.seconds)
   });     
}

stream.foreachRDD(processRDD(_))

Basically, I am using Reactive Mongo (Scala) and for each RDD, I convert it into dataframe, group/extract the necessary data and then fire a large number of database update query against mongo. I want to ask:

I am using mesos to deploy spark on 3 servers and have one more server for mongo database. Is this the correct way to handle database connection. My concern is if database connection / polling is opened at the beginning of spark job and maintained properly (despite timeout/network error failover) during the whole duration of spark(weeks, months....) and if it will be closed when each batch finished? Given the fact that job might be scheduled on different servers? Does it means that each batch, it will open different set of DB connections?
What happen if exception occurs when executing queries. The spark job for that batch will failed? But the next batch will keep continue?
If there is too many queries (2000->+) to run update on mongo-database, and the executing time is exceeding configured spark batch duration (2 minutes), will it cause the problem? I was noticed that with my current setup, after abt 2-3 days, all of the batch is queued up as "Process" on Spark WebUI (if i disable the mongo update part, then i can run one week without prob), none is able to exit properly. Which basically hang up all batch job until i restart/resubmit the job.

Thanks a lot. I appreciate if you can help me address the issue.

回答1:

Please read "Design Patterns for using foreachRDD" section in http://spark.apache.org/docs/latest/streaming-programming-guide.html. This will clear your doubts about how connections should be used/ created.

Secondly i would suggest to keep the direct update operations separate from your Spark Job. Better way would be that your spark job, process the data and then post it into a Kafka Queue and then have another dedicated process/ job/ code which reads the data from Kafka Queue and perform insert/ update operation on Mongo DB.

来源：https://stackoverflow.com/questions/33709769/handle-database-connection-inside-spark-streaming

标签

apache-spark

spark-streaming

mesos

spark-dataframe