set spark.streaming.kafka.maxRatePerPartition for createDirectStream

后端 未结 2 354
南方客
南方客 2020-12-28 23:48

I need to increase the input rate per partition for my application and I have use .set(\"spark.streaming.kafka.maxRatePerPartition\",100) for the config. The st

相关标签:
2条回答
  • 2020-12-29 00:19

    Property fetches N messages from a partition per seconds. If I have M partition and batch interval is B, then total messages I can see in batch is N * M * B.

    There are few things you should verify

    1. Is your input rate is >500 for 10s.
    2. Is kafka topic is properly partitioned.
    0 讨论(0)
  • 2020-12-29 00:24

    The stream duration is 10s so I expect process 5*100*10=5000 messages for this batch.

    That's not what the setting means. It means "how many elements each partition can have per batch", not per second. I'm going to assume you have 5 partitions, so you're getting 5 * 100 = 500. If you want 5000, set maxRatePerPartition to 1000.

    From "Exactly-once Spark Streaming From Apache Kafka" (written by the Cody, the author of the Direct Stream approach, emphasis mine):

    For rate limiting, you can use the Spark configuration variable spark.streaming.kafka.maxRatePerPartition to set the maximum number of messages per partition per batch.

    Edit:

    After @avrs comment, I looked inside the code which defines the max rate. As it turns out, the heuristic is a bit more complex than stated in both the blog post and the docs.

    There are two branches. If backpressure is enabled alongside maxRate, then the maxRate is the minimum between the current backpressure rate calculated by the RateEstimator object and maxRate set by the user. If it isn't enabled, it takes the maxRate defined as is.

    Now, after selecting the rate it always multiplies by the total batch seconds, effectively making this a rate per second:

    if (effectiveRateLimitPerPartition.values.sum > 0) {
      val secsPerBatch = context.graph.batchDuration.milliseconds.toDouble / 1000
      Some(effectiveRateLimitPerPartition.map {
        case (tp, limit) => tp -> (secsPerBatch * limit).toLong
      })
    } else {
      None
    }
    
    0 讨论(0)
提交回复
热议问题