How can I effectively encode/decode a compressed position description?

和自甴很熟 提交于 2019-12-04 11:47:24

I have found a more elegant solution for up to 16 positions using 64-bit integers with a single loop for both encoding and decoding:

#include <stdio.h>
#include <stdlib.h>

void encode16(int dest[], int src[], int n) {
    unsigned long long state = 0xfedcba9876543210;
    for (int i = 0; i < n; i++) {
        int p4 = src[i] * 4;
        dest[i] = (state >> p4) & 15;
        state -= 0x1111111111111110 << p4;
    }
}

void decode16(int dest[], int src[], int n) {
    unsigned long long state = 0xfedcba9876543210;
    for (int i = 0; i < n; i++) {
        int p4 = src[i] * 4;
        dest[i] = (state >> p4) & 15;
        unsigned long long mask = ((unsigned long long)1 << p4) - 1;
        state = (state & mask) | ((state >> 4) & ~mask);
    }
}

int main(int argc, char *argv[]) {
    int naive[argc], compact[argc];
    int n = argc - 1;

    for (int i = 0; i < n; i++) {
        naive[i] = atoi(argv[i + 1]);
    }

    encode16(compact, naive, n);
    for (int i = 0; i < n; i++) {
        printf("%d ", compact[i]);
    }
    printf("\n");

    decode16(naive, compact, n);
    for (int i = 0; i < n; i++) {
        printf("%d ", naive[i]);
    }
    printf("\n");
    return 0;
}

The code uses 64-bit unsigned integers to hold arrays of 16 values in the range 0..15. Such an array can be updated in parallel in a single step, extracting a value is straightforward and deleting a value is a bit more cumbersome but still only a few steps.

You could extend this method to 25 positions using non-portable 128-bit integers (type __int128 is supported by both gcc and clang), encoding each position on 5 bits, taking advantage of the fact that 5 * 25 < 128, but the magical constants are more cumbersome to write.

The naive solution to the problem: create an array where the values are initially equal to the indexes. When you use a square, take its value from the array, and decrement all the values to the right. The running time of this solution is O(n*p) where n is the number of squares on the board and p is the number of pieces on the board.

int codes[25];

void initCodes( void )
{
    for ( int i = 0; i < 25; i++ )
        codes[i] = i;
}

int getCodeForLocation( int location )
{
    for ( int i = location + 1; i < 25; i++ )
        codes[i]--;
    return codes[location];
}

You can attempt to improve the performance of this code with binning. Consider the locations on the board as 5 bins of 5 locations each. Each bin has an offset and each location in a bin has an value. When a value is taken from bin y at location x, then the offsets for all bins below y are decremented. And all values to the right of x in bin y are decremented.

int codes[5][5];
int offset[5];

void initCodes( void )
{
    int code = 0;
    for ( int row = 0; row < 5; row++ )
    {
        for ( int col = 0; col < 5; col++ )
            codes[row][col] = code++;
        offset[row] = 0;
    }
}

int getCodeForLocation( int location )
{
    int startRow = location / 5;
    int startCol = location % 5;
    for ( int col = startCol+1; col < 5; col++ )
        codes[startRow][col]--;
    for ( int row = startRow+1; row < 5; row++ )
        offset[row]--;
    return codes[startRow][startCol] + offset[startRow];
}

The running time of this solution is O(sqrt(n) * p). However, on a board with 25 squares, you won't see much improvement. To see why consider the actual operations done by the naive solution versus the binned solution. Worst case, the naive solution updates 24 locations. Worst case, the binned solution updates 4 entries in the offset array, and 4 locations in the codes array. So that seems like a 3:1 speedup. However, the binned code contains a nasty division/modulo instruction, and is more complicated overall. So you might get a 2:1 speedup if you're lucky.

If the board size was huge, e.g. 256x256, then binning would be great. The worst case for the naive solution would be 65535 entries, whereas binning would update a maximum of 255+255=510 array entries. So that would definitely make up for the nasty division and increased code complexity.

And therein lies the futility of trying to optimize small problem sets. You don't save much changing O(n) to O(sqrt(n)) or O(log(n)) when you have n=25 sqrt(n)=5 log(n)=5. You get a theoretical speedup, but that's almost always a false savings when you consider the myriad constant factors that big-O so blithely ignores.


For completeness, here's the driver code that can be used with either snippet above

int main( void )
{
    int locations[6] = { 5,2,3,0,7,4 };
    initCodes();
    for ( int i = 0; i < 6; i++ )
        printf( "%d ", getCodeForLocation(locations[i]) );
    printf( "\n" );
}

Output: 5 2 2 0 3 1

Your encoding technique has the property that the value of each element of the output tuple depends on the values of the corresponding element and all preceding elements of the input tuple. I don't see a way to accumulate partial results during computation of one encoded element that could be reused in computation of a different one, and without that, no computation of the encoding can scale more (time) efficiently than o(n2) in the number of elements to be encoded. Therefore, For the problem size you describe, I don't think you can do much better than this:

typedef <your choice> element_t;

void encode(element_t in[], element_t out[], int num_elements) {
    for (int p = 0; p < num_elements; p++) {
        element_t temp = in[p];

        for (int i = 0; i < p; i++) {
            temp -= (in[i] < in[p]);
        }

        out[p] = temp;
    }
}

The corresponding decoding could be done like this:

void decode(element_t in[], element_t out[], int num_elements) {
    for (int p = 0; p < num_elements; p++) {
        element_t temp = in[p];

        for (int i = p - 1; i >= 0; i--) {
            temp += (in[i] <= temp);
        }

        out[p] = temp;
    }
}

There are approaches that scale better, some of them discussed in comments and in other answers, but my best guess is that your problem size is not large enough for their improved scaling to overcome their increased overhead.

Obviously, these transformation do not themselves change the size of the representation at all. The encoded representation is easier to validate, however, because each position in a tuple can be validated independently from the others. For that reason, the whole space of valid tuples also can be enumerated much more efficiently in the encoded form than in the decoded form.

I continue to maintain that the decoded form can be stored almost as efficiently as the encoded form, especially if you want to be able to address individual position descriptions. If your objective for the encoded form is to support bulk enumeration, then you could consider enumerating tuples in the "encoded" form, but storing and subsequently using them in the decoded form. The small amount of extra space needed might very well be worth it for the benefit of not needing to perform the decoding after reading, especially if you plan to read a lot of these.


Update:

In response to your comment, the elephant in the room is the question of how you convert the encoded form to a single index such as you describe, such that there are as few unused indices as possible. I think that is the disconnect that spawned so much discussion that you considered off-topic, and I presume that you have some assumptions about that feeding into your assertion of a 24x space savings.

The encoded form is more easily converted to a compact index. For example, you can treat the position as a little-endian number with the board size as its radix:

#define BOARD_SIZE 25
typedef <big enough> index_t;

index_t to_index(element_t in[], int num_elements) {
    // The leading digit must not be zero
    index_t result = in[num_elements - 1] + 1;

    for (int i = num_elements - 1; i--; ) {
        result = result * BOARD_SIZE + in[i];
    }    
}

There are still gaps in that, to be sure, but I estimate them to constitute a reasonably small proportion of the overall range of index values used (and arranging for that to be so is the reason for taking a little-endian interpretation). I leave the reverse transformation as an exercise :).

To convert from naive to compact position, you can iterate over the n-tuple and perform these steps for each position p:

  1. optionally check that position p is available
  2. set position p as busy
  3. subtract from p the number of lower positions that are busy
  4. store the result into the destination n-tuple

You can do this by maintaining an array of n bits for the busyness state:

  • step 1, 2 and 4 are computed in constant time
  • step 3 can be computed efficiently if the array is small, ie: 64 bits.

Here is an implementation:

#include <stdio.h>
#include <stdlib.h>

/* version for up to 9 positions */
#define BC9(n)  ((((n)>>0)&1) + (((n)>>1)&1) + (((n)>>2)&1) + \
                 (((n)>>3)&1) + (((n)>>4)&1) + (((n)>>5)&1) + \
                 (((n)>>6)&1) + (((n)>>7)&1) + (((n)>>8)&1))
#define x4(m,n)    m(n), m((n)+1), m((n)+2), m((n)+3)
#define x16(m,n)   x4(m,n), x4(m,(n)+4), x4(m,(n)+8), x4(m,(n)+12)
#define x64(m,n)   x16(m,n), x16(m,(n)+16), x16(m,(n)+32), x16(m,(n)+48)
#define x256(m,n)  x64(m,n), x64(m,(n)+64), x64(m,(n)+128), x64(m,(n)+192)

static int const bc512[1 << 9] = {
    x256(BC9, 0),
    x256(BC9, 256),
};

int encode9(int dest[], int src[], int n) {
    unsigned int busy = 0;
    for (int i = 0; i < n; i++) {
        int p = src[i];
        unsigned int bit = 1 << p;
        //if (busy & bit) return 1;  // optional validity check
        busy |= bit;
        dest[i] = p - bc512[busy & (bit - 1)];
    }
    return 0;
}

/* version for up to 64 positions */
static inline int bitcount64(unsigned long long m) {
    m = m - ((m >> 1) & 0x5555555555555555);
    m = (m & 0x3333333333333333) + ((m >> 2) & 0x3333333333333333);
    m = (m + (m >> 4)) & 0x0f0f0f0f0f0f0f0f;
    m = m + (m >> 8);
    m = m + (m >> 16);
    m = m + (m >> 16 >> 16);
    return m & 0x3f;
}

int encode64(int dest[], int src[], int n) {
    unsigned long long busy = 0;
    for (int i = 0; i < n; i++) {
        int p = src[i];
        unsigned long long bit = 1ULL << p;
        //if (busy & bit) return 1;  // optional validity check
        busy |= bit;
        dest[i] = p - bitcount64(busy & (bit - 1));
    }
    return 0;
}

int main(int argc, char *argv[]) {
    int src[argc], dest[argc];
    int cur, max = 0, n = argc - 1;

    for (int i = 0; i < n; i++) {
        src[i] = cur = atoi(argv[i + 1]);
        if (max < cur)
            max = cur;
    }
    if (max < 9) {
        encode9(dest, src, n);
    } else {
        encode64(dest, src, n);
    }
    for (int i = 0; i < n; i++) {
        printf("%d ", dest[i]);
    }
    printf("\n");
    return 0;
}

The core optimisation is in the implementation of bitcount(), which you can tailor to your needs by specializing it to the actual number of positions. I posted above efficient solutions for small numbers upto 9 and large numbers upto 64, but you can craft a more efficient solution for 12 or 32 positions.

In terms of time complexity, in the general case, we still have O(n2), but for small values of n, it actually runs in O(n.Log(n)) or better, since the implementation of bitcount() in parallel can be reduced to log(n) steps or less for n up to 64.

You can look at http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive for inspiration and amazement.

Unfortunately, I'm still looking for ways to use this or a similar trick for decoding...

fuz

In this answer, I want to show some of my own ideas for implementing the conversions as well as some benchmarking results.

You can find the code on Github. These are the results on my main machine:

algorithm   ------ total  time ------  ---------- per  call -----------
            decoding encoding total    decoding   encoding   total
baseline    0.0391s  0.0312s  0.0703s    3.9062ns   3.1250ns   7.0312ns
count       1.5312s  1.4453s  2.9766s  153.1250ns 144.5312ns 297.6562ns
bitcount    1.5078s  0.0703s  1.5781s  150.7812ns   7.0312ns 157.8125ns
decrement   2.1875s  1.7969s  3.9844s  218.7500ns 179.6875ns 398.4375ns
bin4        2.1562s  1.7734s  3.9297s  215.6250ns 177.3438ns 392.9688ns
bin5        2.0703s  1.8281s  3.8984s  207.0312ns 182.8125ns 389.8438ns
bin8        2.0547s  1.8672s  3.9219s  205.4688ns 186.7188ns 392.1875ns
vector      0.3594s  0.2891s  0.6484s   35.9375ns  28.9062ns  64.8438ns
shuffle     0.1328s  0.3438s  0.4766s   13.2812ns  34.3750ns  47.6562ns
tree        2.0781s  1.7734s  3.8516s  207.8125ns 177.3438ns 385.1562ns
treeasm     1.4297s  0.7422s  2.1719s  142.9688ns  74.2188ns 217.1875ns
bmi2        0.0938s  0.0703s  0.1641s    9.3750ns   7.0312ns  16.4062ns

Implementations

  • baseline is an implementation that does nothing except reading the input. It's purpose is measuring function call and memory access overhead.
  • count is a “naïve” implementations that stores an occupancy map indicating which squares have pieces on them already
  • bitcount is the same thing but with the occupancy map stored as a bitmap. __builtin_popcount is used for encoding, speeding things up considerably. If one uses a hand-written popcount instead, bitcount is still the fastest portable implementation of encoding.
  • decrement is the second naïve implementation. It stores the encoding for each square of the board, after adding a piece all square numbers to the right are decremented.
  • bin4, bin5, and bin8 use binning with bins sized 4, 5, and 8 entries as suggested by user3386109.
  • shuffle computes a slightly different encoding based on the Fisher-Yates shuffle. It works by reconstructing the random values that would have went into a shuffle generating the permuation we want to encode. The code is branchless and fast, in particular when decoding.
  • vector uses a vector of five bit numbers as suggested by chqrlie.
  • tree uses a difference tree which is a data structure I made up. It's a full binary tree of depth ⌈log2n⌉ where the leaves represent each square and the inner nodes on the path to each leave sum to the code of that square (only the nodes where you go right are added). The square numbers are not stored, leading to n − 1 words of extra memory.

    With this data structure, we can compute the code for each square in ⌈log2n⌉ − 1 steps and mark a square as occupied in the same number of steps. The inner loop is very simple comprising a branch and two actions, depending on whether you descend to the left or to the right. On ARM, this branch compiles to a few conditional instructions, leading to a very fast implementation. On x86, neither gcc nor clang are smart enough to get rid of the branches.

  • treeasm is a variant of tree that uses inline assembly to implement the inner loop of tree without branches by carefully manipulating the carry flag.
  • bmi2 uses the pdep and pext instructions from the BMI2 instruction set to implement the algorithm in a very fast manner.

For my actual project, I'm probably going to use the shuffle implementation since it is the fastest one that does not depend on any unportable extensions (such as Intel intrinsics) or implementation details (such as the availability of 128 bit integers).

To go from (5, 2, 3, 0, 7, 4) to (5, 2, 2, 0, 3, 1) you just have to :

  • start with (5, 2, 3, 0, 7, 4), push five in the result (5)
  • take 2 and count the number of preceding values less than 2, 0 then push 2-0: (5, 2)
  • take 3, count the number of preceding values less than 3, 1 then push 3-1: (5, 2, 2)
  • take 0, count the number of preceding values less than 0, 0 then push 0-0 (5,2, 2, 0)
  • take 7, count..., 4 then push 7-4: (5,2,2,0,3)
  • take 4, count..., 3 then push 4-3: (5,2,2,0,3,1)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!