I have a large number of sets of numbers. Each set contains 10 numbers and I need to remove all sets that have 5 or more number (unordered) matches with any other set.
F
Looks like you want to use the HashSet class. This should give you O(1)
lookup time, which should give very efficient comparison if you get your loops right. (I'm not discussing the algorithm here, but rather simply suggesting a data structure in case it helps.)
You don't say much about what the range of numbers that might appear are, but I have two ideas:
an inverted list that maps a number that appears in the lists to the lists that contain it, then intersect those lists to find those that have more than one number in common.
divide the numbers or group them into ranges of "close" numbers, then refine (narrow) the lists that have numbers appear in those ranges. You reduce the ranges for matching lists you have a manageable number of lists and you can compare the lists exactly . This would be a "proximity" approach I think.
it's an easy problem because your sets are limited to size of ten. For every set of ten numbers you have less than 1,000 subsets of the set which contain at least five numbers. Select a hash function that hashes integer sequences into, say, 32-bit numbers. For every set of ten integers, calculate the value of this hash function for every subset of integers with five or more elements. This gives less than 1,000 hash values per one set of ten numbers. Add a pointer to the set of ten integers to a hash table under all these 1,000 keys. Once you have done this, your hash table has 1,000 * 10,000 = 10 million entries, which is completely doable; and this first pass is linear (O(n)) because the individual set size is bounded by 10.
In the next pass, iterate through all the hash values in whatever order. Whenever there are more than one set associated with the same hash value, this means that most likely they contain a common subset of at least five integers. Verify this, and then erase one of the sets and the corresponding hash table entries. Continue through the hash table. This is also an O(n) step.
Finally, suppose that you are doing this in C. Here is a routine that would calculate the hash values for a single set of ten integers. It is assumed that the integers are in ascending order:
static int hash_index;
void calculate_hash(int *myset, unsigned int *hash_values)
{
hash_index = 0;
hrec(myset, hash_values, 0, 0, 0);
}
void hrec(int *myset, unsigned int *hash_values,
unsigned int h, int idx, int card)
{
if (idx == 10) {
if (card >= 5) {
hash_values[hash_index++] = h;
}
return;
}
unsigned int hp = h;
hp += (myset[idx]) + 0xf0f0f0f0;
hp += (hp << 13) | (hp >> 19);
hp *= 0x7777;
hp += (hp << 13) | (hp >> 19);
hrec(myset, hash_values, hp, idx + 1, card + 1);
hrec(myset, hash_values, h, idx + 1, card);
}
This recurses through all the 1024 subsets and stores the hash values for subsets with cardinality 5 or more in the hash_values
array. At the end, hash_index counts the number of valid entries. It is of course constant but I didn't calculate it numerically here.
There is a way to do this with high time efficiency but extremely low space efficiency.
If my maths is correct, every combination of 5 numbers from a set of 10 results in 10!(10-5)!5! = 252 combinations multiplied by 10000 sets = 2.52 million combinations. A set of 5 integers will consume 20 bytes so you could put every combination for every set into a HashSet
. and only use 5 megabytes (plus overhead, which will blow it out by 2-3 times at least).
Now that might seem expensive but if the alternative, when you check a new set of 10 against the existing 10000 indidvidually, is that you calculate 252 sets of 5 and see if any of them are in the set then it has to be better.
Basically:
public class SetOf5 {
private final static HashSet<Integer> numbers;
private final int hashCode;
public SetOf5(int... numbers) {
if (numbers.length != 5) {
throw new IllegalArgumentException();
}
Set<Integer> set = new HashSet<Integer>();
hashCode = 19;
for (int i : numbers) {
set.add(i);
hashCode = 31 * i + hashCode;
}
this.numbers = Collections.unmodifiableSet(set);
}
// other constructors for passing in, say, an array of 5 or a Collectio of 5
// this is precalculated because it will be called a lot
public int hashCode() {
return numbers.hashCode();
}
public boolean equals(Object ob) {
if (!(ob instanceof SetOf5)) return false;
SetOf5 setOf5 = (SetOf5)ob;
return numbers.containsAll(setOf5.numbers);
}
}
You then just have to do two things:
HashSet<SetOf5>
for all your existing tuples of 5; andYour algorithm then becomes: for each set of 10 numbers, create all possible sets of 5, check each one to see if it's in the set. If it is, reject the set of 10. If it's not, add the set of 5 to the "set of sets". Otherwise continue.
I think you'll find that'll be an awful lot cheaper--at least in the case of 5 numbers from 10--than any brute force comparison of 10000 sets with one another.
Since you need to compare all pair of sets, the algorithm is about O(N^2) where N is the size of the set.
For each comparison, you can do about O(X+Y), where X and Y are the size of two sets, in your case 10 each, so it is constant. But this requires you sort all the sets beforehand, so that adds to O(N*xlgx), again xlgx is constant in your case.
The linear comparison algorithm for two sets is fairly simple as the sets are sorted now, you can iterating both the sets at the same time. See c++ std::set_intersection for detail.
The entire algorithm is then O(N^2), which would be pretty slow for 10000 sets.
You should rethink your requirements because as it is, the operation does not even have a well-defined result. For example, take these sets:
set 1: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
set 2: {6, 7, 8, 9, 10, 11, 12, 13, 14, 15}
set 3: {11, 12, 13, 14, 15, 16, 17, 18, 19, 20}
If you first consider 1 and 2 to be "duplicates" and eliminate set 1, then 2 and 3 are also "duplicates" and you are left with only one remaining set. But if you instead eliminate set 2 first, then 1 and 3 have no matches and you are left with two sets remaining.
You can easily expand this to your full 10,000 sets so that it would be possible that depending on which sets you compare and eliminate first, you could be left with only a single set, or with 5,000 sets. I don't think that is what you want.
Mathematically speaking, your problem is that you are trying to find equivalence classes, but the relation "similarity" you use to define them is not an equivalence relation. Specifically, it is not transitive. In layman's terms, if set A is "similar" to set B and set B is "similar" to set C, then your definition does not ensure that A is also "similar" to C, and therefore you cannot meaningfully eliminate similar sets.
You need to first clarify your requirements to deal with this problem before worrying about an efficient implementation. Either find a way to define a transitive similarity, or keep all sets and work only with comparisons (or with a list of similar sets for each single set).