number to unique permutation mapping of a sequence containing duplicates

后端 未结 5 1128
野趣味
野趣味 2021-02-02 00:26

I am looking for an algorithm that can map a number to a unique permutation of a sequence. I have found out about Lehmer codes and the factorial number system thanks to a simila

相关标签:
5条回答
  • 2021-02-02 01:10

    From Permutation to Number:

    Let K be the number of character classes (example: AAABBC has three character classes)

    Let N[K] be the number of elements in each character class. (example: for AAABBC, we have N[K]=[3,2,1], and let N= sum(N[K])

    Every legal permutation of the sequence then uniquely corresponds to a path in an incomplete K-way tree.

    The unique number of the permutation then corresponds to the index of the tree-node in a post-order traversal of the K-ary tree terminal nodes.

    Luckily, we don't actually have to perform the tree traversal -- we just need to know how many terminal nodes in the tree are lexicographically less than our node. This is very easy to compute, as at any node in the tree, the number terminal nodes below the current node is equal to the number of permutations using the unused elements in the sequence, which has a closed form solution that is a simple multiplication of factorials.

    So given our 6 original letters, and the first element of our permutation is a 'B', we determine that there will be 5!/3!1!1! = 20 elements that started with 'A', so our permutation number has to be greater than 20. Had our first letter been a 'C', we could have calculated it as 5!/2!2!1! (not A) + 5!/3!1!1! (not B) = 30+ 20, or alternatively as 60 (total) - 5!/3!2!0! (C) = 50

    Using this, we can take a permutation (e.g. 'BAABCA') and perform the following computations: Permuation #= (5!/2!2!1!) ('B') + 0('A') + 0('A')+ 3!/1!1!1! ('B') + 2!/1!

    = 30 + 3 +2 = 35

    Checking that this works: CBBAAA corresponds to

    (5!/2!2!1! (not A) + 5!/3!1!1! (not B)) 'C'+ 4!/2!2!0! (not A) 'B' + 3!/2!1!0! (not A) 'B' = (30 + 20) +6 + 3 = 59

    Likewise, AAABBC = 0 ('A') + 0 'A' + '0' A' + 0 'B' + 0 'B' + 0 'C = 0

    Sample implementation:

    import math
    import copy
    from operator import mul
    
    def computePermutationNumber(inPerm, inCharClasses):
        permutation=copy.copy(inPerm)
        charClasses=copy.copy(inCharClasses)
    
        n=len(permutation)
        permNumber=0
        for i,x in enumerate(permutation):
            for j in xrange(x):
                if( charClasses[j]>0):
                    charClasses[j]-=1
                    permNumber+=multiFactorial(n-i-1, charClasses)
                    charClasses[j]+=1
            if charClasses[x]>0:
                charClasses[x]-=1
        return permNumber
    
    def multiFactorial(n, charClasses):
        val= math.factorial(n)/ reduce(mul, (map(lambda x: math.factorial(x), charClasses)))
        return val
    

    From Number to Permutation: This process can be done in reverse, though I'm not sure how efficiently: Given a permutation number, and the alphabet that it was generated from, recursively subtract the largest number of nodes less than or equal to the remaining permutation number.

    E.g. Given a permutation number of 59, we first can subtract 30 + 20 = 50 ('C') leaving 9. Then we can subtract 'B' (6) and a second 'B'(3), re-generating our original permutation.

    0 讨论(0)
  • 2021-02-02 01:10

    A very simple algorithm to mapping a number for a permutation consists of n digits is

    number<-digit[0]*10^(n-1)+digit[1]*10^(n-2)+...+digit[n]*10^0
    

    You can find plenty of resources for algorithms to generate permutations. I guess you want to use this algorithm in bioinformatics. For example you can use itertools.permutations from Python.

    0 讨论(0)
  • 2021-02-02 01:19

    Assuming the resulting number fits inside a word (e.g. 32 or 64 bit integer) relatively easily, then much of the linked article still applies. Encoding and decoding from a variable base remains the same. What changes is how the base varies.

    If you're creating a permutation of a sequence, you pick an item out of your bucket of symbols (from the original sequence) and put it at the start. Then you pick out another item from your bucket of symbols and put it on the end of that. You'll keep picking and placing symbols at the end until you've run out of symbols in your bucket.

    What's significant is which item you picked out of the bucket of the remaining symbols each time. The number of remaining symbols is something you don't have to record because you can compute that as you build the permutation -- that's a result of your choices, not the choices themselves.

    The strategy here is to record what you chose, and then present an array of what's left to be chosen. Then choose, record which index you chose (packing it via the variable base method), and repeat until there's nothing left to choose. (Just as above when you were building a permuted sequence.)

    In the case of duplicate symbols it doesn't matter which one you picked, so you can treat them as the same symbol. The difference is that when you pick a symbol which still has a duplicate left, you didn't reduce the number of symbols in the bucket to pick from next time.

    Let's adopt a notation that makes this clear:

    Instead of listing duplicate symbols left in our bucket to choose from like c a b c a a we'll list them along with how many are still in the bucket: c-2 a-3 b-1.

    Note that if you pick c from the list, the bucket has c-1 a-3 b-1 left in it. That means next time we pick something, we have three choices.

    But on the other hand, if I picked b from the list, the bucket has c-2 a-3 left in it. That means next time we pick something, we only have two choices.

    When reconstructing the permuted sequence we just maintain the bucket the same way as when we were computing the permutation number.

    The implementation details aren't trivial, but they're straightforward with standard algorithms. The only thing that might heckle you is what to do when a symbol in your bucket is no longer available.

    Suppose your bucket was represented by a list of pairs (like above): c-1 a-3 b-1 and you choose c. Your resulting bucket is c-0 a-3 b-1. But c-0 is no longer a choice, so your list should only have two entries, not three. You could move the entire list down by 1 resulting in a-3 b-1, but if your list is long this is expensive. A fast an easy solution: move the last element of the bucket into the removed location and decrease your bucket size: c0 a-3 b-1 becomes b-1 a-3 <empty> or just b-1 a-3.

    Note that we can do the above because it doesn't matter what order the symbols in the bucket are listed in, as long as it's the same way when we encode or decode the number.

    0 讨论(0)
  • 2021-02-02 01:28

    As I was unsure of the code in gbronner's answer (or of my understanding), I recoded it in R as follows

    ritpermz=function(n, parclass){
        return(factorial(n) / prod(factorial(parclass)))}
    
    rankum <- function(confg, parclass){
        n=length(confg)
        permdex=1
        for (i in 1:(n-1)){
          x=confg[i]
          if (x > 1){
            for (j in 1:(x-1)){
                if(parclass[j] > 0){
                    parclass[j]=parclass[j]-1
                    permdex=permdex + ritpermz(n-i, parclass)
                    parclass[j]=parclass[j]+1}}}
            parclass[x]=parclass[x]-1
        }#}
        return(permdex)
    }
    

    which does produce a ranking with the right range of integers

    0 讨论(0)
  • 2021-02-02 01:31

    Here is an algorithm in Java that enumerates the possible sequences by mapping an integer to the sequence.

    public class Main {
    
        private int[] counts = { 3, 2, 1 }; // 3 Symbols A, 2 Symbols B, 1 Symbol C
        private int n = sum(counts);
    
        public static void main(String[] args) {
            new Main().enumerate();
        }
    
        private void enumerate() {
            int s = size(counts);
            for (int i = 0; i < s; ++i) {
                String p = perm(i);
                System.out.printf("%4d -> %s\n", i, p);
            }
    
        }
    
        // calculates the total number of symbols still to be placed
        private int sum(int[] counts) {
            int n = 0;
            for (int i = 0; i < counts.length; i++) {
                n += counts[i];
            }
            return n;
        }
    
        // calculates the number of different sequences with the symbol configuration in counts
        private int size(int[] counts) {
            int res = 1;
            int num = 0;
            for (int pos = 0; pos < counts.length; pos++) {
                for (int den = 1; den <= counts[pos]; den++) {
                    res *= ++num;
                    res /= den;
                }
            }
            return res;
        }
    
        // maps the sequence number to a sequence
        private String perm(int num) {
            int[] counts = this.counts.clone();
            StringBuilder sb = new StringBuilder(n);
            for (int i = 0; i < n; ++i) {
                int p = 0;
                for (;;) {
                    while (counts[p] == 0) {
                        p++;
                    }
                    counts[p]--;
                    int c = size(counts);
                    if (c > num) {
                        sb.append((char) ('A' + p));
                        break;
                    }
                    counts[p]++;
                    num -= c;
                    p++;
                }
            }
            return sb.toString();
        }
    
    }
    

    The mapping used by the algorithm is as follows. I use the example given in the question (3 x A, 2 x B, 1 x C) to illustrate it.

    There are 60 (=6!/3!/2!/1!) possible sequences in total, 30 (=5!/2!/2!/1!) of them have an A at the first place, 20 (=5!/3!/1!/1!) have a B at the first place, and 10 (=5!/3!/2!/0!) have a C at the first place.

    The numbers 0..29 are mapped to all sequences starting with an A, 30..49 are mapped to the sequences starting with B, and 50..59 are mapped to the sequences starting with C.

    The same process is repeated for the next place in the sequence, for example if we take the sequences starting with B we have now to map numbers 0 (=30-30) .. 19 (=49-30) to the sequences with configuration (3 x A, 1 x B, 1 x C)

    0 讨论(0)
提交回复
热议问题