Why does memory usage of spark worker increases with time?

问题

I have a Spark Streaming application running which uses mapWithState function to track state of RDD. The application runs fine for few minutes but then crashes with

org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 373

I observed that Memory usage of Spark application increases over time linearly even though i have set the timeout for mapWithStateRDD. Please see the code snippet below and memory usage -

val completedSess = sessionLines
                    .mapWithState(StateSpec.function(trackStateFunction _)
                    .numPartitions(80)
                    .timeout(Minutes(5)))

Why should the memory increase linearly over time if there is an explicit timeout for each RDD ?

I have tried increasing the memory but it does not matter. What am i missing ?

Edit - Code for reference

def trackStateFunction(batchTime: Time, key: String, value: Option[String], state: State[(Boolean, List[String], Long)]): Option[(Boolean, List[String])] ={

  def updateSessions(newLine: String): Option[(Boolean, List[String])] = {
    val currentTime = System.currentTimeMillis() / 1000

    if (state.exists()) {
      val newLines = state.get()._2 :+ newLine

      //check if end of Session reached.
      // if yes, remove the state and return. Else update the state
      if (isEndOfSessionReached(value.getOrElse(""), state.get()._4)) {
        state.remove()
        Some(true, newLines)
      }
      else {
        val newState = (false, newLines, currentTime)
        state.update(newState)
        Some(state.get()._1, state.get()._2)
      }
    }
    else  {
      val newState = (false, List(value.get), currentTime)
      state.update(newState)
      Some(state.get()._1, state.get()._2)
    }
  }

  value match {
    case Some(newLine) => updateSessions(newLine)
    case _ if state.isTimingOut() => Some(true, state.get()._2)
    case _ => {
      println("Not matched to any expression")
      None
    }
  }
}

回答1:

According to the information of mapwithstate: State Specification An initial state as RDD - You can load the initial state from some store and then start your streaming job with that state.

Number of partitions - The key value state dstream is partitioned by keys. If you have a good estimate of the size of the state before, you can provide the number of partitions to partition it accordingly.

Partitioner - You can also provide a custom partitioner. The default partitioner is hash partitioner. If you have a good understanding of the key space, then you can provide a custom partitioner that can do efficient updates than the default hash partitioner.

Timeout - This will ensure that keys whose values are not updated for a specific period of time will be removed from the state. This can help in cleaning up the state with old keys.

So the timeout has only to do with cleaning after a while with the keys that are not updating. The memory will run full and eventually block, because the executors do not have enough memory assigned. This gives the MetaDataFetchFailed exception. With Increasing the memory, I hope you mean the executors. Even then increasing the memory for the executors probably doesn't work since the stream still continues. With MapWithState the sessionlines will contain the same # of records as the input dstream. So to solve this is to make your dstream smaller. In the streaming context you can set a batch interval which will most likely solve this

val ssc = new StreamingContext(sc, Seconds(batchIntervalSeconds))

Remember to also make once in a while a snapshot and a checkpoint. The snapshots will allow you to use the information from the now earlier lost stream for other calculations. Hopefully this helped for more information see: https://docs.cloud.databricks.com/docs/spark/1.6/examples/Streaming%20mapWithState.html , and http://asyncified.io/2016/07/31/exploring-stateful-streaming-with-apache-spark/

回答2:

mapWithState is also storing mappedValues in the RAM (see MapWithStateRDD), by defaults mapWithState is storing in the RAM up to 20 MapWithStateRDDs.

Briefly the RAM usage is proportional to the batch interval,

You can try to reduce the batch interval to reduce RAM usage.

来源：https://stackoverflow.com/questions/42641573/why-does-memory-usage-of-spark-worker-increases-with-time

标签

scala

apache-spark

spark-streaming