Efficient algorithm for finding all maximal subsets

前端未结

关注

 4  1444

I have a collection of unique sets (represented as bit masks) and would like to eliminate all elements that are proper subsets of another element. For example:

Approach #2 - Use a bucket approach

Same assumptions. Can uniqueness be assumed? (i.e. there is not {1,4,6},{1,4,6}) Otherwise, you would need to check for distinct at some point, probably once the buckets are created.

semi psuedo

List<Set> Sets;//input
List<Set> Output;
List<List<Set>> Buckets;
int length = Sets[0].length;//"by descending lengths"
List<Set> Bucket = new List<Set>();//current bucket

//Place each set with shared length in its own bucket
for( Set set in Sets )
{
 if( set.length == length )//current Bucket
 {
  Bucket.add(set);
 }else//new Bucket
 {
  length = set.length;
  Buckets.Add(Bucket);
  Bucket = new Bucket();
  Bucket.Add(set);
 }
}
Buckets.add(Bucket);



//Based on the assumption of uniqueness, everything in the first bucket is
//larger than every other set and since it is unique, they are not proper subsets
Output.AddRange(Buckets[0]);

//Iterate through the buckets
for( int i = 1; i < Buckets.length; i++ )
{
 List<Set> currentBucket = Buckets[i];

 //Iterate through the sets in the current bucket
 for( int a = 0; a < currentBucket.length; a++ )
 {
  Set currentSet = currentBucket[a];
  bool addSet = true;
  //Iterate through buckets with greater length
  for( int b = 0; b < i; b++ )
  {
   List<Set> testBucket = Buckets[b];

   //Iterate through the sets in testBucket
   for( int c = 0; c < testBucket.length; c++ )
   {
    Set testSet = testBucket[c];
    int testMatches = 0;

    //Iterate through the values in the current set
    for( int d = 0; d < currentSet.length; d++ )
    {
     int testIndex = 0;

     //Iterate through the values in the test set
     for( ; testIndex < testSet.length; testIndex++ )
     {
      if( currentSet[d] < testSet[testIndex] )
      {
       setClear = true;
       break;
      }
      if( currentSet[d] == testSet[testIndex] )
      {
       testMatches++;
       if( testMatches == currentSet.length )
       {
        addSet = false;
        setClear = true;
        break;
       }
      }
     }//testIndex
     if( setClear ) break;
    }//d
    if( !addSet ) break;
   }//c
   if( !addSet ) break;
  }//b
  if( addSet ) Output.Add( currentSet );
 }//a
}//i

Approach #1 (`O( n(n+1)/2 )`) ... not efficient enough

semi psuedo

//input Sets
List<Set> results;
for( int current = 0; current < Sets.length; current++ )
{
 bool addCurrent = true;
 Set currentSet = Sets[current];
 for( int other = 0; other < current; other++)
 {
  Set otherSet = Sets[other];
  //is current a subset of other?
  if( currentSet.total > otherSet.total 
   || currentSet.length >= otherSet.length) continue;
  int max = currentSet.length;
  int matches = 0;
  int otherIndex = 0, len = otherSet.length;
  for( int i = 0; i < max; i++ )
  {
   for( ; otherIndex < len; otherIndex++ )
   {
     if( currentSet[i] == otherSet[otherInex] )
     {
      matches++;
      break;
     }
   }
   if( matches == max )
   {
    addCurrent = false;
    break;
   }
  }
  if( addCurrent ) results.Add(currentSet);
 }
}

This will take the set of sets, and iterate through each one. With each one, it will iterate through each set in the set again. As the nested iteration takes place, it will compare if the outer set is the same as the nested set (from the inner iteration) (if they are, no checking is done), it will also compare if the outer set has a total greater than the nested set (if the total is greater, then the outer set cannot be a proper subset), it will then compare if the outer set has a smaller amount of items than the nested set.

Once those checks are complete it begins with the first item of the outer set, and compares it with the first item of the nested set. If they are not equal, it will check the next item of the nested set. If they are equal, then it adds one to a counter, and will then compare the next item of the outer set with where it left off in the inner set.

If it reaches a point where the amount of matched comparisons equal the number of items in the outer set, then the outer set has been found to be a proper subset of the inner set. It is flagged to be excluded, and the comparisons are halted.

0 讨论(0)

不思量自难忘°

2020-12-01 05:16

This problem has been studied in literature. Given S_1,...,S_k which are subsets of {1,...,n}, Yellin [1] gave an algorithm to find the maximal subset of {S_1,...,S_k} in time O(kdm) where d is the average size of the S_i, and m is the cardinality of the the maximal subset of {S_1,...,S_k}. This was later improved for some range of parameters by Yellin and Jutla [2] to O((kd)^2/sqrt(log(kd))). It is believed that a truly sub-quadratic algorithm to this problem does not exist.

[1] Daniel M. Yellin: Algorithms for Subset Testing and Finding Maximal Sets. SODA 1992: 386-392.

[2] Daniel M. Yellin, Charanjit S. Jutla: Finding Extremal Sets in Less than Quadratic Time. Inf. Process. Lett. 48(1): 29-34 (1993).

0 讨论(0)

逝去的感伤

2020-12-01 05:21

Off the top of my head there is an O(D*N*log(N)) where D is the number of unique numbers.

The recursive function "helper" works as follows: @arguments is sets and domain (number of unique numbers in sets): Base cases:

If the domain is empty, return
If sets is empty or sets has length equal to 1, return

Iterative case:

Remove all empty sets from sets
Pick an element D in domain
Remove D from the domain
Separate sets into two sets (set1 & set2) based on whether the set contains D
Remove D from each set in sets
Set result = union ( helper(set1,domain), helper(set2,domain) )
For each set in set1 add D back
return result

Note that the runtime depends on the Set implementation used. If a doubly linked list is used to store the set then:

Steps 1-5,7 take O(N) Step 6's union is O(N*log(N)) by sorting and then merging

Therefore the overall algorithm is O(D*N*log(N))

Here is java code to perform the following

import java.util.*;

public class MyMain {

    public static Set<Set<Integer>> eliminate_subsets(Set<Set<Integer>> sets) throws Exception {
        Set<Integer> domain = new HashSet<Integer>();
        for (Set<Integer> set : sets) {
            for (Integer i : set) {
                domain.add(i);
            }
        }
        return helper(sets,domain);
    }

    public static Set<Set<Integer>> helper(Set<Set<Integer>> sets, Set<Integer> domain) throws Exception {
        if (domain.isEmpty()) { return sets; }
        if (sets.isEmpty()) { return sets; }
        else if (sets.size() == 1) { return sets; }

        sets.remove(new HashSet<Integer>());

        // Pop some value from domain
        Iterator<Integer> it = domain.iterator();
        Integer splitNum = it.next();
        it.remove();

        Set<Set<Integer>> set1 = new HashSet<Set<Integer>>(); 
        Set<Set<Integer>> set2 = new HashSet<Set<Integer>>();
        for (Set<Integer> set : sets) {
            if (set.contains(splitNum)) {
                set.remove(splitNum);
                set1.add(set);
            }
            else {
                set2.add(set);
            }
        }

        Set<Set<Integer>> ret = helper(set1,domain);
        ret.addAll(helper(set2,domain));

        for (Set<Integer> set : set1) {
            set.add(splitNum);
        }
        return ret;
    }

    /**
     * @param args
     * @throws Exception 
     */
    public static void main(String[] args) throws Exception {
        // TODO Auto-generated method stub
        Set<Set<Integer>> s=new HashSet<Set<Integer>>();
        Set<Integer> tmp = new HashSet<Integer>();
        tmp.add(new Integer(1)); tmp.add(new Integer(2)); tmp.add(new Integer(3));
        s.add(tmp);

        tmp = new HashSet<Integer>();
        tmp.add(new Integer(1)); tmp.add(new Integer(2));
        s.add(tmp);

        tmp = new HashSet<Integer>();
        tmp.add(new Integer(3)); tmp.add(new Integer(4));
        s.add(tmp);
        System.out.println(eliminate_subsets(s).toString());
    }


}

*New years is disruptive

0 讨论(0)

Efficient algorithm for finding all maximal subsets

Approach #2 - Use a bucket approach

Approach #1 (O( n(n+1)/2 )) ... not efficient enough

Approach #1 (`O( n(n+1)/2 )`) ... not efficient enough