问题
I have a Kafka Streams Application which takes data from few topics and joins the data and puts it in another topic.
Kafka Configuration:
5 kafka brokers
Kafka Topics - 15 partitions and 3 replication factor.
Note: I am running Kafka Streams Applications on the same machines where my Kafka Brokers are running.
Few millions of records are consumed/produced every hour. Whenever I take any kafka broker down, it goes into rebalancing and it takes approx. 30 minutes or sometimes even more for rebalancing and many times it kills many of the Kafka Streams processes.
回答1:
It is technically possible to run your Kafka Streams application on the same servers as your broker. But it is not recommended. Both would need to share the same resources and you would end up with a contention.
Whenever I take any kafka broker down, it goes into rebalancing
Not sure why this is happening. What version of Kafka or Streams API are you using? If you are on broker 0.10.1+ I would highly recommend to upgrade your Streams application to 0.11 (note, you can do this without broker upgrade).
Depending on the details of the issue you are phasing, StandbyTask
might help with long rebalance times. You can simple configure parameter num.standby.replica = 1
to enable StandbyTask
s.
回答2:
Answering the question in the title:
Coming from a Spark/HDFS background, I think this is a change of thinking, since you are used to think that it is good to have your processing where your data is, to take advantage of data locality. Here, the broker will provide the data locality but will have to send the data to Kafka Streams cluster for processing (losing some of its benefits). However, keeping them separate allows you to manage both clusters separately.
If you think of a cluster that runs high-latency processing jobs, that shares data + processing (e.g. an HDFS + YARN cluster), you can get "the process where data is" and not the opposite. You can allocate resources for your data processing - but the idea is that your processing does not depend on temporary data spikes (as it does with Streaming) but on the total data volumes. If your data grows, your calculations will take more, and you can allocate more resources, but they will grow at the same time. However, on a streaming application, necessary processing power does depend on data spikes (and your low-latency requirements) and not on total data volumes, so it makes sense that storage and processing are dimensioned and managed separately, since their elasticity demands are not based on the same dimension.
This comes apart from the obvious fact that having both data handling - Kafka broker - and data processing - Kafka Streams in the same node puts more load into a node, but we are assuming here this has been taken into account when dimensioning your nodes.
来源:https://stackoverflow.com/questions/46176362/can-i-run-kafka-streams-application-on-the-same-machine-as-of-kafka-broker