How to efficiently produce messages out of a collection to Kafka

旧城冷巷雨未停 提交于 2020-06-16 19:07:36

问题


In my Scala (2.11) stream application I am consuming data from one queue in IBM MQ and writing it to a Kafka topic that has one partition. After consuming the data from the MQ the message payload gets splitted into 3000 smaller messages that are stored in a Sequence of Strings. Then each of these 3000 messages are send to Kafka (version 2.x) using KafkaProducer.

How would you send those 3000 messages?

I can't increase the number of queues in IBM MQ (not under my control) nor the number of partitions in the topic (ordering of messages is required, and writing a custom partitioner will impact too many consumers of the topic).

The Producer settings are currently:

  • acks=1
  • linger.ms=0
  • batch.size=65536

But optimizing them is probably a question of its own and not part of my current problem.

Currently, I am doing

import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}

private lazy val kafkaProducer: KafkaProducer[String, String] = new KafkaProducer[String, String](someProperties)
val messages: Seq[String] = Seq(String1, …, String3000)
for (msg <- messages) {
    val future = kafkaProducer.send(new ProducerRecord[String, String](someTopic, someKey, msg))
    val recordMetadata = future.get()
}

To me it looks like not the most elegant and most efficient way. Is there a programmatic way to increase throughput?


edit after answer from @radai

Thanks to the answer pointing me to the right direction I had a closer look into the different Producer methods. The book Kafka - The Definitive Guide list these methods:

Fire-and-forget We send a message to the server and don’t really care if it arrives succesfully or not. Most of the time, it will arrive successfully, since Kafka is highly available and the producer will retry sending messages automatically. However, some messages will get lost using this method.

Synchronous send We send a message, the send() method returns a Future object, and we use get() to wait on the future and see if the send() was successful or not.

Asynchronous send We call the send() method with a callback function, which gets triggered when it receives a response from the Kafka broker

And now my code looks like this (leaving out error handling and the definition of Callback class):

  val asyncProducer = new KafkaProducer[String, String](someProperties)

  for (msg <- messages) {
    val record = new ProducerRecord[String, String](someTopic, someKey, msg)
    asyncProducer.send(record, new compareProducerCallback)
  }
  asyncProducer.flush()

I have compared all the methods for 10000 very small messages. Here is my measure result:

  1. Fire-and-forget: 173683464ns

  2. Synchronous send: 29195039875ns

  3. Asynchronous send: 44153826ns

To be honest, there is probably more potential to optimize all of them by choosing the right properties (batch.size, linger.ms, ...).


回答1:


the biggest reason i can see for your code to be slow is that youre waiting on every single send future.

kafka was designed to send batches. by sending one record at a time youre waiting round-trip time for every single record and youre not getting any benefit from compression.

the "idiomatic" thing to do would be send everything, and then block on all the resulting futures in a 2nd loop.

also, if you intend to do this i'd bump linger back up (otherwise your 1st record would result in a batch of size one, slowing you down overall. see https://en.wikipedia.org/wiki/Nagle%27s_algorithm) and call flush() on the producer once your send loop is done.



来源:https://stackoverflow.com/questions/58420634/how-to-efficiently-produce-messages-out-of-a-collection-to-kafka

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!