Given a list of objects with multiple attributes I need to find the list of sets created by a union of all intersecting subsets.
Specifically these are Person object
So your collection example could look like this:
A { ss |-> 42, dl |-> 123 }
B { ss |-> 42, dl |-> 456 }
C { ss |-> 23, dl |-> 456 }
D { ss |-> 89, dl |-> 789 }
E { ss |-> 89, dl |-> 432 }
Then I would suggest to use an algorithm where you build up multi-collections by incrementally merging or inserting each collection into the multi-collections:
Iteration 1. The first collection becomes the only multi-collection:
{A} { ss |-> [42], dl |-> [123] }
Iteration 2. Merge the next collection into the first since SSN is already present:
{A,B} { ss |-> [42], dl |-> [123,456] }
Iteration 3. Merge again, since the DLN is already there:
{A,B,C} { ss |-> [23,42], dl |-> [123,456] }
Iteration 4. Insert a new multi-collection since there is no match:
{A,B,C} { ss |-> [23,42], dl |-> [123,456] }
{D} { ss |-> [89], dl |-> [789] }
Iteration 5. Merge with second multi collection, since the SSN is there:
{A,B,C} { ss |-> [23,42], dl |-> [123,456] }
{D,E} { ss |-> [89], dl |-> [432,789] }
So in each iteration (one for each collection), you must identify all multi-collections that have values in common with the collection you are processing, and merge all these together.
In general, if there are n collections each with a constant k number of attributes, then this algorithm will run in time O(nnk) = O(n2). The worst-case behaviour is exibited if all attribute values are distinct. When there is more sharing between attribute values, the time that it takes to insert and determine membership in the attribute value sets (like [23,42]) gets to be the dominant factor, so the attribute value sets should be efficient.
If you use optimal disjoint sets, then each Find or Merge operation will run in amortized time O(α(n)).
Thus, for each iteration there will be at most n multi-collections (the situation when no multi-collections have been merged so far). To integrate the new collection into the multi-collections, you will have to perform a Find operation on each of the multi-collections k sets to identify all multi-collections to be merged, which takes time bounded by O(nkα(n)). To merge the at most k multi-collections found this way takes O(k2α(n)).
So for all iteration the time is bounded by O(n(nkα(n)+k2α(n))) = O(n(nkα(n))) = O(n2kα(n)) = O(n2α(n)) since k is a constant.
Because α(n) for all practical purposes is also a constant, the total time is bounded by O(n2).
First, is there some inherent hierarchy in identifiers, and do contradicting identifiers of a higher sort cancel out the same identifier of a lower sort? For example, if A and B have the same SSN, B and C have the same DLN, and C and D have the same SSN which does not match A and B's SSN, does that mean that there are two groups or one?
Assuming contradictions don't matter, you are dealing with equivalence classes, as user 57368 (unknown Google) states. For equivalence classes, people often turn to the Union-find structure. As for how to perform these unions, it's not immediately trivial because I assume you don't have the direct link A-B when both A and B have the same SSN. Instead, our sets will consist of two kinds of elements. Each (attribute type, attribute value) = attribute
pair is an element. You also have elements corresponding to object
s. When you iterate through the list of attributes for an object, perform the union (object, attribute)
.
One of the important features of the Union-find data structure is that the resulting structure represents the set. It lets you query "What set is A in?" If this is not enough, let us know and we can improve the result.
But the most important feature is that the algorithm has something which resembles constant-time behavior for each union and query operation.
I would guess that you have a relatively small set of attributes for the Person object (as compared to the number of Person objects you're considering). If you want to reduce traversing the list of Person objects multiple times, you can take a Person, put its attributes into a list of known possible connections and then move on to the next Person. With each successive Person, you see if it is connected to any prior connection. If so, then you add its unique attributes to the possible connections. You should be able to process all Person objects in one pass. It's possible that you'll have some disconnected sets in the results, so it may be worth examining the unconnected Person objects after you've created the first graph.
To expand on my comment in the original post, you want to create a list of sets where each member of a given set shares at least one attribute with at least one other member of that set.
Naively, this can be solved either by finding all pairs that share an attribute and merging pairs together that have the same partner iteratively. This would be O(N^3) (N^2 for iterating over pairs, and up to N separate sets to determine membership).
You can also think of this problem as determining the connected component of a graph, where every object and every unique attribute value is a node; each object would be connected to each of its attribute values. Setting up that graph would take linear time, and you could determine the connected components in linear time with a breadth or depth first search.
while (!people.isEmpty()) {
Person first = people.get(0);
people.remove(first);
Set<Person> set = makeSet(first);
for (Person person : people) {
for (Person other : set) {
if (person.isRelatedTo(other)) {
set.add(person);
people.remove(person);
}
}
}
sets.add(set);
}
for (Set<Person> a : sets) {
for (Set<Person> b : sets.except(a)) {
for (Person person : a)
for (Person other : b) {
if (person.isRelatedTo(other)) {
a.addAll(b);
b.clear();
sets.remove(b);
break;
}
}
}
}