问题
I have a topic consisting of n partitions. To have distributed processing I create two processes running on different machines. They subscribe to the topic with same groupd id and allocate n/2 threads, each of which processes single stream(n/2 partitions per process).
With this I will have achieved load distribution, but now if process 1 crashes, than process 2 cannot consume messages from partitions allocated to process 1, as it listened only on n/2 streams at the start.
Or else, if I configure for HA and start n threads/streams on both processes, then when one node fails, all partitions will be processed by other node. But here, we have compromised distribution, as all partitions will be processed by a single node at a time.
Is there a way to achieve both simultaneously and how?
回答1:
Yes, use an existing stream processing engine. Storm is a good choice, as are Spark and Samza, depends on your use case.
Now you could roll your own, but as you've already discovered, managing failing processes and high availability is tricky. Generally speaking, distributed processing is filled with lots of subtle problems that someone else has already solved. In your shoes I'd use existing software to deal with that problem.
来源:https://stackoverflow.com/questions/30060261/how-to-achieve-distributed-processing-and-high-availability-simultaneously-in-ka