How to efficiently de-interleave bits (inverse Morton)

后端 未结 5 1054
情歌与酒
情歌与酒 2020-12-05 03:15

This question: How to de-interleave bits (UnMortonizing?) has a good answer for extracting one of the two halves of a Morton number (just the odd bits), but I need a solutio

相关标签:
5条回答
  • 2020-12-05 03:33

    In case someone is using morton codes in 3d, so he needs to read one bit every 3, and 64 bits here is the function I used:

    uint64_t morton3(uint64_t x) {
        x = x & 0x9249249249249249;
        x = (x | (x >> 2))  & 0x30c30c30c30c30c3;
        x = (x | (x >> 4))  & 0xf00f00f00f00f00f;
        x = (x | (x >> 8))  & 0x00ff0000ff0000ff;
        x = (x | (x >> 16)) & 0xffff00000000ffff;
        x = (x | (x >> 32)) & 0x00000000ffffffff;
        return x;
    }
    uint64_t bits; 
    uint64_t x = morton3(bits)
    uint64_t y = morton3(bits>>1)
    uint64_t z = morton3(bits>>2)
    
    0 讨论(0)
  • 2020-12-05 03:41

    If your processor handles 64 bit ints efficiently, you could combine the operations...

    int64 w = (z &0xAAAAAAAA)<<31 | (z &0x55555555 )
    w = (w | (w >> 1)) & 0x3333333333333333;
    w = (w | (w >> 2)) & 0x0F0F0F0F0F0F0F0F; 
    ...
    
    0 讨论(0)
  • 2020-12-05 03:46

    I didn't want to be limited to a fixed size integer and making lists of similar commands with hardcoded constants, so I developed a C++11 solution which makes use of template metaprogramming to generate the functions and the constants. The assembly code generated with -O3 seems as tight as it can get without using BMI:

    andl    $0x55555555, %eax
    movl    %eax, %ecx
    shrl    %ecx
    orl     %eax, %ecx
    andl    $0x33333333, %ecx
    movl    %ecx, %eax
    shrl    $2, %eax
    orl     %ecx, %eax
    andl    $0xF0F0F0F, %eax
    movl    %eax, %ecx
    shrl    $4, %ecx
    orl     %eax, %ecx
    movzbl  %cl, %esi
    shrl    $8, %ecx
    andl    $0xFF00, %ecx
    orl     %ecx, %esi
    

    TL;DR source repo and live demo.


    Implementation

    Basically every step in the morton1 function works by shifting and adding to a sequence of constants which look like this:

    1. 0b0101010101010101 (alternate 1 and 0)
    2. 0b0011001100110011 (alternate 2x 1 and 0)
    3. 0b0000111100001111 (alternate 4x 1 and 0)
    4. 0b0000000011111111 (alternate 8x 1 and 0)

    If we were to use D dimensions, we would have a pattern with D-1 zeros and 1 one. So to generate these it's enough to generate consecutive ones and apply some bitwise or:

    /// @brief Generates 0b1...1 with @tparam n ones
    template <class T, unsigned n>
    using n_ones = std::integral_constant<T, (~static_cast<T>(0) >> (sizeof(T) * 8 - n))>;
    
    /// @brief Performs `@tparam input | (@tparam input << @tparam width` @tparam repeat times.
    template <class T, T input, unsigned width, unsigned repeat>
    struct lshift_add :
        public lshift_add<T, lshift_add<T, input, width, 1>::value, width, repeat - 1> {
    };
    /// @brief Specialization for 1 repetition, just does the shift-and-add operation.
    template <class T, T input, unsigned width>
    struct lshift_add<T, input, width, 1> : public std::integral_constant<T,
        (input & n_ones<T, width>::value) | (input << (width < sizeof(T) * 8 ? width : 0))> {
    };
    

    Now that we can generate the constants at compile time for arbitrary dimensions with the following:

    template <class T, unsigned step, unsigned dimensions = 2u>
    using mask = lshift_add<T, n_ones<T, 1 << step>::value, dimensions * (1 << step), sizeof(T) * 8 / (2 << step)>;
    

    With the same type of recursion, we can generate functions for each of the steps of the algorithm x = (x | (x >> K)) & M:

    template <class T, unsigned step, unsigned dimensions>
    struct deinterleave {
        static T work(T input) {
            input = deinterleave<T, step - 1, dimensions>::work(input);
            return (input | (input >> ((dimensions - 1) * (1 << (step - 1))))) & mask<T, step, dimensions>::value;
        }
    };
    // Omitted specialization for step 0, where there is just a bitwise and
    

    It remains to answer the question "how many steps do we need?". This depends also on the number of dimensions. In general, k steps compute 2^k - 1 output bits; the maximum number of meaningful bits for each dimension is given by z = sizeof(T) * 8 / dimensions, therefore it is enough to take 1 + log_2 z steps. The problem is now that we need this as constexpr in order to use it as a template parameter. The best way I found to work around this is to define log2 via metaprogramming:

    template <unsigned arg>
    struct log2 : public std::integral_constant<unsigned, log2<(arg >> 1)>::value + 1> {};
    template <>
    struct log2<1u> : public std::integral_constant<unsigned, 0u> {};
    
    /// @brief Helper constexpr which returns the number of steps needed to fully interleave a type @tparam T.
    template <class T, unsigned dimensions>
    using num_steps = std::integral_constant<unsigned, log2<sizeof(T) * 8 / dimensions>::value + 1>;
    

    And finally, we can perform one single call:

    /// @brief Helper function which combines @see deinterleave and @see num_steps into a single call.
    template <class T, unsigned dimensions>
    T deinterleave_first(T n) {
        return deinterleave<T, num_steps<T, dimensions>::value - 1, dimensions>::work(n);
    }
    
    0 讨论(0)
  • 2020-12-05 03:47

    Code for the Intel Haswell and later CPUs. You can use the BMI2 instruction set which contains the pext and pdep instructions. These can (among other great things) be used to build your functions.

    #include <immintrin.h>
    #include <stdint.h>
    
    // on GCC, compile with option -mbmi2, requires Haswell or better.
    
    uint64_t xy_to_morton (uint32_t x, uint32_t y)
    {
        return _pdep_u32(x, 0x55555555) | _pdep_u32(y,0xaaaaaaaa);
    }
    
    uint64_t morton_to_xy (uint64_t m, uint32_t *x, uint32_t *y)
    {
        *x = _pext_u64(m, 0x5555555555555555);
        *y = _pext_u64(m, 0xaaaaaaaaaaaaaaaa);
    }
    
    0 讨论(0)
  • 2020-12-05 03:53

    If you need speed than you can use table-lookup for one byte conversion at once (two bytes table is faster but to big). Procedure is made under Delphi IDE but the assembler/algorithem is the same.

    const
      MortonTableLookup : array[byte] of byte = ($00, $01, $10, $11, $12, ... ;
    
    procedure DeinterleaveBits(Input: cardinal);
    //In: eax
    //Out: dx = EvenBits; ax = OddBits;
    asm
      movzx   ecx, al                                     //Use 0th byte
      mov     dl, byte ptr[MortonTableLookup + ecx]
    //
      shr     eax, 8
      movzx   ecx, ah                                     //Use 2th byte
      mov     dh, byte ptr[MortonTableLookup + ecx]
    //
      shl     edx, 16
      movzx   ecx, al                                     //Use 1th byte
      mov     dl, byte ptr[MortonTableLookup + ecx]
    //
      shr     eax, 8
      movzx   ecx, ah                                     //Use 3th byte
      mov     dh, byte ptr[MortonTableLookup + ecx]
    //
      mov     ecx, edx  
      and     ecx, $F0F0F0F0
      mov     eax, ecx
      rol     eax, 12
      or      eax, ecx
    
      rol     edx, 4
      and     edx, $F0F0F0F0
      mov     ecx, edx
      rol     ecx, 12
      or      edx, ecx
    end;
    
    0 讨论(0)
提交回复
热议问题