问题
Background with example:
We are building a stream processing application that receives a stream of data, runs algorithms on the data and stores the results in a database. As an example for this question let's use a stream of purchases. Each purchase has the geo-location of the purchase (store location or IP based location).
The stream of purchases is coming from a Kafka topic.
We now want to process that stream of data and output some statistics. For example, we want to see the average purchase price for every 100x100 meters square in the world.
We want to be able to dynamically scale to be able to handle spikes and also not use resources we don't need.
From all the components in our solution, let's look into 1 small part, the part that needs to update the database. Let's say we store the final statistics in some kind of database, it contains the average price for every 100x100 meters square but also the count and the sum (to recalculate on new data).
Given a newly purchased item, we find out the relevant 100x100 square, load the data from the database, update the values in memory and then write the updated values to the database. (We actually need to run a bit more complicated algorithm but for the sake of the example let's keep it simple).
This sounds easy, but when we start thinking about scaling it becomes more challenging.
The challenge:
In order to support high scale, we want to have multiple instances of the processing service which handles the purchases and updates the relevant 100x100 squares in the database.
The problem occurs when 2 instances try to process 2 different purchases falling into the same 100x100 square. In this case, there might be a race condition that will result in only one of the purchases affecting the final result.
Let's say the events occur in the following order:
- 1st instance takes the 1st purchase and loads the data from the database (data-A)
- 2nd instance takes the 2nd purchase and loads the data from the database (data-A)
- 1st instance updates data-A to data-B and writes it to the database
- 2nd instance updates data-A to data-C and writes it to the database
As a result, data-C is calculated by data-A and purchase 2 and missing the information from purchase 1.
Our thoughts
We thought to solve it using Kafka topic partitions. Kafka assigns partitions of a topic to consumers and every message goes into a single partition of a topic. We can also control the partition it goes to, let's say by the ID of the 100x100 square. This way all the purchases from a given square will fall to the same partition, causing it to arrive to the same consumer. This way there will never be 2 instances of our service that want to update the same square.
This works fine until we want to apply dynamic scaling. When Kafka consumers are added or removed, Kafka re-assigns the partitions to the consumers, which may cause in some cases a partition moving from one consumer to another. This can cause 2 instances processing purchases from the same squares for a limited amount of time.
To overcome this, we thought about using distributed locks that can be implemented over MongoDB, Redis and probably other databases. Combined with the Kafka topic partitioning, most of the time the locks will not block any processing, and only in the event of dynamic scaling, some workers will be blocked for a small period of time.
We have 2 main issues with the approach we found. It is complicated and adds locking time latency when most of the time we don't need it.
Questions:
- Is there a better approach to solve our problem?
- Will our approach actually work? Are we missing something in how Kafka works?
Edit 30.10.2019 12:00:
Avoiding locks:
To avoid locks we thought about using an update condition with MongoDB. By doing so we are trying to implement an optimistic locking strategy.
We will have the square ID property uniquely indexed and we will add a version field to each document. There are 2 options:
The document doesn't exist yet. When we try to read existing data we don't get any result so we know the document doesn't exist. Now we create the first value and try to insert it. If we succeed, this means we are the first ones that do it, if we fail (unique constraint failure on the square ID), this means that someone else inserted the value for the first time. We then will go to option number 2 below.
The document exists. When we read the existing data, we save the old version. When we want to update the document in the DB, we send the new value and bump the version number by 1. We also specify an update condition to be the document ID and the version to be equal to the old version. When we expect the update result we can see how many documents were updated. If there is zero updates that means the version was changed, at this point we retry all the process again. If there is 1 update, this means we succeeded.
(This blog post explains it in much more detail)
This will only work if we perform the read and the write operations with read/write concern set to majority (otherwise we will lose data).
The question we have now is the performance, which will work faster the distributed locking or the majority read/write concern updates?
来源:https://stackoverflow.com/questions/58609347/synchronize-writes-to-db-from-dynamically-scaled-microservices