A distributed algorithm to assign a shared resource to one node out of many

问题

I try to implement a multicast communication to distribute some resources. I’m using jGroups for that, so I have reliable multicast and FIFO-Ordering. By doing that I wanna realise that with a distributed solution, that means without an master node that acts as a coordinator.

Every node is able to start a distribution, so it is possible that two or more nodes are starting the distribution at the same time. When a node receives a distribution-message it will answer this. There are no differences between an answer-message and a message from a starter. It only contains information about the resource-name (e.g. resourceA) and if that node is able to handle it. When Member1 starting the distribution it will send a message like:

Member1, resourceA, OK

Member2 has no space for that resource and send a answer-message like:

Member2, resourceA, NOT_OK

In this case it is easy, because now Member1 knows that he can takes resourceA. When more than one node are able to handle the resource other properties will decide who takes the resource (e.g. the member with the highest ID).

My problem is: How ta handle it when two or more nodes are starting the distribution about the same topic (resourceA) at the same time?

Does anybody see some problems doing it this way: Member1 and Member2 are starting the distribution at the same time. At this point both are expecting a response from each other. Because of the fact that there are no differences in a response-message or in a starter-message both are thinking the message they just received is an answerer. So Member1 sends a starter-message into the multicast-group and Member2 sends a starter-message into the mutlicast-group (before receiving the message from Member1). Now Member1 is receiving the starter-message from Member2 and thinks that this is the response.

By guarantee that every node sends only one message per topic (as a starter or with a response) I would say that there are no problems doing it this way even when there are more than two nodes.

回答1:

From your description, the following conclusions can be drawn:

All members are assumed to be running from the start, no new members will be added once this system is running and no members will be removed either
All members are aware of the total number of members in the system

If one of these conclusions is incorrect (or both), then I do not see how your algorithm works, because there is no way to know when all members have responded to a starter message and conclude which member has the highest ID.

If both conclusions are correct, then I do not see a problem with the functionality of the algorithm and your approach seems to work. However, the resulting system will be error prone with regard to a failing or non-responding member. If one member does not respond to a starter message, then you could end up with the situation where it is impossible to decide who will take the resource, because it might or might not be that non-responding member.

Unfortunately, it is very likely that one of the members will not respond at some point -- although you did not give any information about uptime requirements. To avoid a total breakdown of the algorithm just because one member is not responding, you will have to design precautions into the algorithm, for example by adding a time-out and remove a member from the "known members list" if it does not respond in time.

But even with such built-in fault tolerance, you should realize that a completely distributed solution without some kind of coordinating master will, by definition, have situations that are difficult to deal with. For example, in a distributed environment, a network problem could lead to a situation where one half of the network does not see the other half. Since there is no coordinating master to draw any final conclusions, both halves of the network think "they know the world" and will continue to do their thing. In order to make decisions about how to resolve that, you will have to be more clear about your requirements and give a better picture of possible fault situations...

来源：https://stackoverflow.com/questions/12450987/a-distributed-algorithm-to-assign-a-shared-resource-to-one-node-out-of-many

标签

java

networking

network-programming

distributed-computing

multicast