How to distribute data evenly in Kafka producing messages through Spark?

大憨熊 提交于 2021-02-05 08:10:41

问题


I have a streaming job that writes data into Kafka and I've noticed one of Kafka partitions (#3) takes more data then others.

+-----------------------------------------------------+
| partition | messages  | earlist offset | next offset|
+-----------------------------------------------------+
|1          | 166522754 | 5861603324     | 6028126078 |
|2          | 152251127 | 6010226633     | 6162477760 |
|3          | 382935293 | 6332944925     | 6715880218 |
|4          | 188126274 | 6171311709     | 6359437983 |
|5          | 188270700 | 6100140089     | 6288410789 |
+-----------------------------------------------------+

I found one option - to repartition output dataset using number of Kafka partitions (5).

Is there any other way to distribute data evenly?


回答1:


How data ist partitioned in Kafka does not depend on how the data is partitioned in Spark and its Dataset. From Kafka perspective it depends on the keys of the message or you apply a custom Partitioner class when writing to Kafka.

There are the following scenarios how data is partitioned in Kafka:

Message key null and no custom partitioner

If no key is definied in the Kafka messages, Kafka will distribute the messages in a round-robin fashion accross all partitions.

Messages key not null and no custom partitioner

If you provide a message key, by default, Kafka will decide on the partition based on

hash(key) % numer_of_partitions

Provide custom partitioner

In case you want full control on how Kafka stores messages in the partitions of a topic, you can write your own Partitioner class and set this as the partitioner.class in your Producer configuration.

Here is an example of how a customer partitioner class could like

public class MyPartitioner implements Partitioner {
  public void configure(Map<String, ?> configs) {}
  public void close() {}

  public int partition(String topic, Object key, byte[] keyBytes,
                       Object value, byte[] valueBytes, Cluster cluster) {
    List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
    int numPartitions = partitions.size();

    if ((keyBytes == null) || (!(key instanceOf String)))
      throw new InvalidRecordException("Record did not have a string Key");

    if (((String) key).equals("myKey"))
       return 0; // This key will always go to Partition 0

    // Other records will go to the rest of the Partitions using a hashing function
    return (Math.abs(Utils.murmur2(keyBytes)) % (numPartitions - 1)) + 1;
  }
}


来源:https://stackoverflow.com/questions/61946420/how-to-distribute-data-evenly-in-kafka-producing-messages-through-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!