问题
I have a streaming job that writes data into Kafka and I've noticed one of Kafka partitions (#3) takes more data then others.
+-----------------------------------------------------+
| partition | messages | earlist offset | next offset|
+-----------------------------------------------------+
|1 | 166522754 | 5861603324 | 6028126078 |
|2 | 152251127 | 6010226633 | 6162477760 |
|3 | 382935293 | 6332944925 | 6715880218 |
|4 | 188126274 | 6171311709 | 6359437983 |
|5 | 188270700 | 6100140089 | 6288410789 |
+-----------------------------------------------------+
I found one option - to repartition output dataset using number of Kafka partitions (5).
Is there any other way to distribute data evenly?
回答1:
How data ist partitioned in Kafka does not depend on how the data is partitioned in Spark and its Dataset. From Kafka perspective it depends on the keys of the message or you apply a custom Partitioner class when writing to Kafka.
There are the following scenarios how data is partitioned in Kafka:
Message key null and no custom partitioner
If no key is definied in the Kafka messages, Kafka will distribute the messages in a round-robin fashion accross all partitions.
Messages key not null and no custom partitioner
If you provide a message key, by default, Kafka will decide on the partition based on
hash(key) % numer_of_partitions
Provide custom partitioner
In case you want full control on how Kafka stores messages in the partitions of a topic, you can write your own Partitioner class and set this as the partitioner.class
in your Producer configuration.
Here is an example of how a customer partitioner class could like
public class MyPartitioner implements Partitioner {
public void configure(Map<String, ?> configs) {}
public void close() {}
public int partition(String topic, Object key, byte[] keyBytes,
Object value, byte[] valueBytes, Cluster cluster) {
List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
int numPartitions = partitions.size();
if ((keyBytes == null) || (!(key instanceOf String)))
throw new InvalidRecordException("Record did not have a string Key");
if (((String) key).equals("myKey"))
return 0; // This key will always go to Partition 0
// Other records will go to the rest of the Partitions using a hashing function
return (Math.abs(Utils.murmur2(keyBytes)) % (numPartitions - 1)) + 1;
}
}
来源:https://stackoverflow.com/questions/61946420/how-to-distribute-data-evenly-in-kafka-producing-messages-through-spark