How can I effectively encode/decode a compressed position description?

I am writing a tablebase for a Japanese chess variant. To index the table base, I encode each chess position as an integer. In one of the encoding steps, I encode where the pieces are on the board. Since the actual method is a bit complicated, let me explain the problem in a simplified manner.

The Encoding

In the endgame tablebase, I have (let's say) six distinct chess pieces that I want to distribute over a board with 9 squares. I can naïvely represent their positions by a six-tuple (a, b, c, d, e, f ) where each of the variables a to f is a number in the range 0 to 8 inclusive indicating where the corresponding chess piece is located.

However, this representation is not optimal: no two chess pieces can occupy the same square but the aforementioned encoding happily allows this. We can encode the same position by a six-tuple [a, b', c', d', e', f' ] where a is the same a as before, b' is a number from 0 to 7 inclusive indicating the number of the square the second piece is on. This works by assigning a number from 0 to 7 to each square the first piece is not on. For example, if the first piece is on square 3, the square numbers for the second piece are:

1st piece: 0 1 2 3 4 5 6 7 8
2nd piece: 0 1 2 - 3 4 5 6 7

the other pieces are encoded similarly, c' as a number from 0 to 6, d' as a number from 0 to 5, etc. For example the naïve encoding (5, 2, 3, 0, 7, 4) yields the compact encoding (5, 2, 2, 0, 3, 1):

1st: 0 1 2 3 4 5 6 7 8 --> 5
2nd: 0 1 2 3 4 - 5 6 7 --> 2
3rd: 0 1 - 2 3 - 4 5 6 --> 2
4th: 0 1 - - 2 - 3 4 5 --> 0
5th: - 0 - - 1 - 2 3 4 --> 3
6th: - 0 - - 1 - 2 - 3 --> 1

In my actual encoding, the number of pieces I want to encode is not fixed. The number of squares on the board however is.

The Question

How can I efficiently convert the naïve representation to the compact representation and vice versa? I use standard C99 for the program. In the context of this question, I am not interested in answers that use non-standard constructs, inline assembly or intrinsics.

Question Clarification

As there seems to be some confusion about the question:

The question is to find a practically efficient way to implement the conversion between the naïve and the compact position representations
Both representations are n-tuples of integers in certain ranges. The question is not about how to encode these representations into anything else.
In one of the cases I have, the number of squares is 25 and the number of pieces is up to 12. I am however interested in an implementation that works for a reasonable parameter space (e.g. up to 64 squares and up to 32 pieces).
I am not interested in alternative representations or encodings, especially representations or encodings that are not optimal.
Nor am I interested in remarks that the compact representation isn't worth the effort.
Nor am I interested in answers that use intrinsics, inline assembly or any other non-standard constructs (except perhaps those described by POSIX).

I have found a more elegant solution for up to 16 positions using 64-bit integers with a single loop for both encoding and decoding:

#include <stdio.h>
#include <stdlib.h>

void encode16(int dest[], int src[], int n) {
    unsigned long long state = 0xfedcba9876543210;
    for (int i = 0; i < n; i++) {
        int p4 = src[i] * 4;
        dest[i] = (state >> p4) & 15;
        state -= 0x1111111111111110 << p4;
    }
}

void decode16(int dest[], int src[], int n) {
    unsigned long long state = 0xfedcba9876543210;
    for (int i = 0; i < n; i++) {
        int p4 = src[i] * 4;
        dest[i] = (state >> p4) & 15;
        unsigned long long mask = ((unsigned long long)1 << p4) - 1;
        state = (state & mask) | ((state >> 4) & ~mask);
    }
}

int main(int argc, char *argv[]) {
    int naive[argc], compact[argc];
    int n = argc - 1;

    for (int i = 0; i < n; i++) {
        naive[i] = atoi(argv[i + 1]);
    }

    encode16(compact, naive, n);
    for (int i = 0; i < n; i++) {
        printf("%d ", compact[i]);
    }
    printf("\n");

    decode16(naive, compact, n);
    for (int i = 0; i < n; i++) {
        printf("%d ", naive[i]);
    }
    printf("\n");
    return 0;
}

The code uses 64-bit unsigned integers to hold arrays of 16 values in the range 0..15. Such an array can be updated in parallel in a single step, extracting a value is straightforward and deleting a value is a bit more cumbersome but still only a few steps.

You could extend this method to 25 positions using non-portable 128-bit integers (type __int128 is supported by both gcc and clang), encoding each position on 5 bits, taking advantage of the fact that 5 * 25 < 128, but the magical constants are more cumbersome to write.

The naive solution to the problem: create an array where the values are initially equal to the indexes. When you use a square, take its value from the array, and decrement all the values to the right. The running time of this solution is O(n*p) where n is the number of squares on the board and p is the number of pieces on the board.

int codes[25];

void initCodes( void )
{
    for ( int i = 0; i < 25; i++ )
        codes[i] = i;
}

int getCodeForLocation( int location )
{
    for ( int i = location + 1; i < 25; i++ )
        codes[i]--;
    return codes[location];
}

You can attempt to improve the performance of this code with binning. Consider the locations on the board as 5 bins of 5 locations each. Each bin has an offset and each location in a bin has an value. When a value is taken from bin y at location x, then the offsets for all bins below y are decremented. And all values to the right of x in bin y are decremented.

int codes[5][5];
int offset[5];

void initCodes( void )
{
    int code = 0;
    for ( int row = 0; row < 5; row++ )
    {
        for ( int col = 0; col < 5; col++ )
            codes[row][col] = code++;
        offset[row] = 0;
    }
}

int getCodeForLocation( int location )
{
    int startRow = location / 5;
    int startCol = location % 5;
    for ( int col = startCol+1; col < 5; col++ )
        codes[startRow][col]--;
    for ( int row = startRow+1; row < 5; row++ )
        offset[row]--;
    return codes[startRow][startCol] + offset[startRow];
}

The running time of this solution is O(sqrt(n) * p). However, on a board with 25 squares, you won't see much improvement. To see why consider the actual operations done by the naive solution versus the binned solution. Worst case, the naive solution updates 24 locations. Worst case, the binned solution updates 4 entries in the offset array, and 4 locations in the codes array. So that seems like a 3:1 speedup. However, the binned code contains a nasty division/modulo instruction, and is more complicated overall. So you might get a 2:1 speedup if you're lucky.

If the board size was huge, e.g. 256x256, then binning would be great. The worst case for the naive solution would be 65535 entries, whereas binning would update a maximum of 255+255=510 array entries. So that would definitely make up for the nasty division and increased code complexity.

And therein lies the futility of trying to optimize small problem sets. You don't save much changing O(n) to O(sqrt(n)) or O(log(n)) when you have n=25 sqrt(n)=5 log(n)=5. You get a theoretical speedup, but that's almost always a false savings when you consider the myriad constant factors that big-O so blithely ignores.

For completeness, here's the driver code that can be used with either snippet above

int main( void )
{
    int locations[6] = { 5,2,3,0,7,4 };
    initCodes();
    for ( int i = 0; i < 6; i++ )
        printf( "%d ", getCodeForLocation(locations[i]) );
    printf( "\n" );
}

Output: 5 2 2 0 3 1

Your encoding technique has the property that the value of each element of the output tuple depends on the values of the corresponding element and all preceding elements of the input tuple. I don't see a way to accumulate partial results during computation of one encoded element that could be reused in computation of a different one, and without that, no computation of the encoding can scale more (time) efficiently than o(n²) in the number of elements to be encoded. Therefore, For the problem size you describe, I don't think you can do much better than this:

typedef <your choice> element_t;

void encode(element_t in[], element_t out[], int num_elements) {
    for (int p = 0; p < num_elements; p++) {
        element_t temp = in[p];

        for (int i = 0; i < p; i++) {
            temp -= (in[i] < in[p]);
        }

        out[p] = temp;
    }
}

The corresponding decoding could be done like this:

void decode(element_t in[], element_t out[], int num_elements) {
    for (int p = 0; p < num_elements; p++) {
        element_t temp = in[p];

        for (int i = p - 1; i >= 0; i--) {
            temp += (in[i] <= temp);
        }

        out[p] = temp;
    }
}

There are approaches that scale better, some of them discussed in comments and in other answers, but my best guess is that your problem size is not large enough for their improved scaling to overcome their increased overhead.

Obviously, these transformation do not themselves change the size of the representation at all. The encoded representation is easier to validate, however, because each position in a tuple can be validated independently from the others. For that reason, the whole space of valid tuples also can be enumerated much more efficiently in the encoded form than in the decoded form.

I continue to maintain that the decoded form can be stored almost as efficiently as the encoded form, especially if you want to be able to address individual position descriptions. If your objective for the encoded form is to support bulk enumeration, then you could consider enumerating tuples in the "encoded" form, but storing and subsequently using them in the decoded form. The small amount of extra space needed might very well be worth it for the benefit of not needing to perform the decoding after reading, especially if you plan to read a lot of these.

Update:

In response to your comment, the elephant in the room is the question of how you convert the encoded form to a single index such as you describe, such that there are as few unused indices as possible. I think that is the disconnect that spawned so much discussion that you considered off-topic, and I presume that you have some assumptions about that feeding into your assertion of a 24x space savings.

The encoded form is more easily converted to a compact index. For example, you can treat the position as a little-endian number with the board size as its radix:

#define BOARD_SIZE 25
typedef <big enough> index_t;

index_t to_index(element_t in[], int num_elements) {
    // The leading digit must not be zero
    index_t result = in[num_elements - 1] + 1;

    for (int i = num_elements - 1; i--; ) {
        result = result * BOARD_SIZE + in[i];
    }    
}

There are still gaps in that, to be sure, but I estimate them to constitute a reasonably small proportion of the overall range of index values used (and arranging for that to be so is the reason for taking a little-endian interpretation). I leave the reverse transformation as an exercise :).

To convert from naive to compact position, you can iterate over the n-tuple and perform these steps for each position p:

optionally check that position p is available
set position p as busy
subtract from p the number of lower positions that are busy
store the result into the destination n-tuple

You can do this by maintaining an array of n bits for the busyness state:

step 1, 2 and 4 are computed in constant time
step 3 can be computed efficiently if the array is small, ie: 64 bits.

Here is an implementation:

#include <stdio.h>
#include <stdlib.h>

/* version for up to 9 positions */
#define BC9(n)  ((((n)>>0)&1) + (((n)>>1)&1) + (((n)>>2)&1) + \
                 (((n)>>3)&1) + (((n)>>4)&1) + (((n)>>5)&1) + \
                 (((n)>>6)&1) + (((n)>>7)&1) + (((n)>>8)&1))
#define x4(m,n)    m(n), m((n)+1), m((n)+2), m((n)+3)
#define x16(m,n)   x4(m,n), x4(m,(n)+4), x4(m,(n)+8), x4(m,(n)+12)
#define x64(m,n)   x16(m,n), x16(m,(n)+16), x16(m,(n)+32), x16(m,(n)+48)
#define x256(m,n)  x64(m,n), x64(m,(n)+64), x64(m,(n)+128), x64(m,(n)+192)

static int const bc512[1 << 9] = {
    x256(BC9, 0),
    x256(BC9, 256),
};

int encode9(int dest[], int src[], int n) {
    unsigned int busy = 0;
    for (int i = 0; i < n; i++) {
        int p = src[i];
        unsigned int bit = 1 << p;
        //if (busy & bit) return 1;  // optional validity check
        busy |= bit;
        dest[i] = p - bc512[busy & (bit - 1)];
    }
    return 0;
}

/* version for up to 64 positions */
static inline int bitcount64(unsigned long long m) {
    m = m - ((m >> 1) & 0x5555555555555555);
    m = (m & 0x3333333333333333) + ((m >> 2) & 0x3333333333333333);
    m = (m + (m >> 4)) & 0x0f0f0f0f0f0f0f0f;
    m = m + (m >> 8);
    m = m + (m >> 16);
    m = m + (m >> 16 >> 16);
    return m & 0x3f;
}

int encode64(int dest[], int src[], int n) {
    unsigned long long busy = 0;
    for (int i = 0; i < n; i++) {
        int p = src[i];
        unsigned long long bit = 1ULL << p;
        //if (busy & bit) return 1;  // optional validity check
        busy |= bit;
        dest[i] = p - bitcount64(busy & (bit - 1));
    }
    return 0;
}

int main(int argc, char *argv[]) {
    int src[argc], dest[argc];
    int cur, max = 0, n = argc - 1;

    for (int i = 0; i < n; i++) {
        src[i] = cur = atoi(argv[i + 1]);
        if (max < cur)
            max = cur;
    }
    if (max < 9) {
        encode9(dest, src, n);
    } else {
        encode64(dest, src, n);
    }
    for (int i = 0; i < n; i++) {
        printf("%d ", dest[i]);
    }
    printf("\n");
    return 0;
}

The core optimisation is in the implementation of bitcount(), which you can tailor to your needs by specializing it to the actual number of positions. I posted above efficient solutions for small numbers upto 9 and large numbers upto 64, but you can craft a more efficient solution for 12 or 32 positions.

In terms of time complexity, in the general case, we still have O(n²), but for small values of n, it actually runs in O(n.Log(n)) or better, since the implementation of bitcount() in parallel can be reduced to log(n) steps or less for n up to 64.

You can look at http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive for inspiration and amazement.

Unfortunately, I'm still looking for ways to use this or a similar trick for decoding...

fuz

In this answer, I want to show some of my own ideas for implementing the conversions as well as some benchmarking results.

You can find the code on Github. These are the results on my main machine:

algorithm   ------ total  time ------  ---------- per  call -----------
            decoding encoding total    decoding   encoding   total
baseline    0.0391s  0.0312s  0.0703s    3.9062ns   3.1250ns   7.0312ns
count       1.5312s  1.4453s  2.9766s  153.1250ns 144.5312ns 297.6562ns
bitcount    1.5078s  0.0703s  1.5781s  150.7812ns   7.0312ns 157.8125ns
decrement   2.1875s  1.7969s  3.9844s  218.7500ns 179.6875ns 398.4375ns
bin4        2.1562s  1.7734s  3.9297s  215.6250ns 177.3438ns 392.9688ns
bin5        2.0703s  1.8281s  3.8984s  207.0312ns 182.8125ns 389.8438ns
bin8        2.0547s  1.8672s  3.9219s  205.4688ns 186.7188ns 392.1875ns
vector      0.3594s  0.2891s  0.6484s   35.9375ns  28.9062ns  64.8438ns
shuffle     0.1328s  0.3438s  0.4766s   13.2812ns  34.3750ns  47.6562ns
tree        2.0781s  1.7734s  3.8516s  207.8125ns 177.3438ns 385.1562ns
treeasm     1.4297s  0.7422s  2.1719s  142.9688ns  74.2188ns 217.1875ns
bmi2        0.0938s  0.0703s  0.1641s    9.3750ns   7.0312ns  16.4062ns

Implementations

baseline is an implementation that does nothing except reading the input. It's purpose is measuring function call and memory access overhead.
count is a “naïve” implementations that stores an occupancy map indicating which squares have pieces on them already
bitcount is the same thing but with the occupancy map stored as a bitmap. __builtin_popcount is used for encoding, speeding things up considerably. If one uses a hand-written popcount instead, bitcount is still the fastest portable implementation of encoding.
decrement is the second naïve implementation. It stores the encoding for each square of the board, after adding a piece all square numbers to the right are decremented.
bin4, bin5, and bin8 use binning with bins sized 4, 5, and 8 entries as suggested by user3386109.
shuffle computes a slightly different encoding based on the Fisher-Yates shuffle. It works by reconstructing the random values that would have went into a shuffle generating the permuation we want to encode. The code is branchless and fast, in particular when decoding.
vector uses a vector of five bit numbers as suggested by chqrlie.
tree uses a difference tree which is a data structure I made up. It's a full binary tree of depth ⌈log₂ n⌉ where the leaves represent each square and the inner nodes on the path to each leave sum to the code of that square (only the nodes where you go right are added). The square numbers are not stored, leading to n − 1 words of extra memory.

With this data structure, we can compute the code for each square in ⌈log₂ n⌉ − 1 steps and mark a square as occupied in the same number of steps. The inner loop is very simple comprising a branch and two actions, depending on whether you descend to the left or to the right. On ARM, this branch compiles to a few conditional instructions, leading to a very fast implementation. On x86, neither gcc nor clang are smart enough to get rid of the branches.
treeasm is a variant of tree that uses inline assembly to implement the inner loop of tree without branches by carefully manipulating the carry flag.
bmi2 uses the pdep and pext instructions from the BMI2 instruction set to implement the algorithm in a very fast manner.

For my actual project, I'm probably going to use the shuffle implementation since it is the fastest one that does not depend on any unportable extensions (such as Intel intrinsics) or implementation details (such as the availability of 128 bit integers).

To go from (5, 2, 3, 0, 7, 4) to (5, 2, 2, 0, 3, 1) you just have to :

start with (5, 2, 3, 0, 7, 4), push five in the result (5)
take 2 and count the number of preceding values less than 2, 0 then push 2-0: (5, 2)
take 3, count the number of preceding values less than 3, 1 then push 3-1: (5, 2, 2)
take 0, count the number of preceding values less than 0, 0 then push 0-0 (5,2, 2, 0)
take 7, count..., 4 then push 7-4: (5,2,2,0,3)
take 4, count..., 3 then push 4-3: (5,2,2,0,3,1)

来源：https://stackoverflow.com/questions/39623081/how-can-i-effectively-encode-decode-a-compressed-position-description

标签

encoding

compression

chess