How to distribute data evenly in Kafka producing messages through Spark?

问题

I have a streaming job that writes data into Kafka and I've noticed one of Kafka partitions (#3) takes more data then others.

+-----------------------------------------------------+
| partition | messages  | earlist offset | next offset|
+-----------------------------------------------------+
|1          | 166522754 | 5861603324     | 6028126078 |
|2          | 152251127 | 6010226633     | 6162477760 |
|3          | 382935293 | 6332944925     | 6715880218 |
|4          | 188126274 | 6171311709     | 6359437983 |
|5          | 188270700 | 6100140089     | 6288410789 |
+-----------------------------------------------------+

I found one option - to repartition output dataset using number of Kafka partitions (5).

Is there any other way to distribute data evenly?

回答1:

How data ist partitioned in Kafka does not depend on how the data is partitioned in Spark and its Dataset. From Kafka perspective it depends on the keys of the message or you apply a custom Partitioner class when writing to Kafka.

There are the following scenarios how data is partitioned in Kafka:

Message key null and no custom partitioner

If no key is definied in the Kafka messages, Kafka will distribute the messages in a round-robin fashion accross all partitions.

Messages key not null and no custom partitioner

If you provide a message key, by default, Kafka will decide on the partition based on

hash(key) % numer_of_partitions

Provide custom partitioner

In case you want full control on how Kafka stores messages in the partitions of a topic, you can write your own Partitioner class and set this as the partitioner.class in your Producer configuration.

Here is an example of how a customer partitioner class could like

public class MyPartitioner implements Partitioner {
  public void configure(Map<String, ?> configs) {}
  public void close() {}

  public int partition(String topic, Object key, byte[] keyBytes,
                       Object value, byte[] valueBytes, Cluster cluster) {
    List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
    int numPartitions = partitions.size();

    if ((keyBytes == null) || (!(key instanceOf String)))
      throw new InvalidRecordException("Record did not have a string Key");

    if (((String) key).equals("myKey"))
       return 0; // This key will always go to Partition 0

    // Other records will go to the rest of the Partitions using a hashing function
    return (Math.abs(Utils.murmur2(keyBytes)) % (numPartitions - 1)) + 1;
  }
}

来源：https://stackoverflow.com/questions/61946420/how-to-distribute-data-evenly-in-kafka-producing-messages-through-spark

标签

apache-spark

apache-kafka

spark-streaming-kafka