This question: How to de-interleave bits (UnMortonizing?) has a good answer for extracting one of the two halves of a Morton number (just the odd bits), but I need a solutio
In case someone is using morton codes in 3d, so he needs to read one bit every 3, and 64 bits here is the function I used:
uint64_t morton3(uint64_t x) {
x = x & 0x9249249249249249;
x = (x | (x >> 2)) & 0x30c30c30c30c30c3;
x = (x | (x >> 4)) & 0xf00f00f00f00f00f;
x = (x | (x >> 8)) & 0x00ff0000ff0000ff;
x = (x | (x >> 16)) & 0xffff00000000ffff;
x = (x | (x >> 32)) & 0x00000000ffffffff;
return x;
}
uint64_t bits;
uint64_t x = morton3(bits)
uint64_t y = morton3(bits>>1)
uint64_t z = morton3(bits>>2)
If your processor handles 64 bit ints efficiently, you could combine the operations...
int64 w = (z &0xAAAAAAAA)<<31 | (z &0x55555555 )
w = (w | (w >> 1)) & 0x3333333333333333;
w = (w | (w >> 2)) & 0x0F0F0F0F0F0F0F0F;
...
I didn't want to be limited to a fixed size integer and making lists of similar commands with hardcoded constants, so I developed a C++11 solution which makes use of template metaprogramming to generate the functions and the constants. The assembly code generated with -O3
seems as tight as it can get without using BMI:
andl $0x55555555, %eax
movl %eax, %ecx
shrl %ecx
orl %eax, %ecx
andl $0x33333333, %ecx
movl %ecx, %eax
shrl $2, %eax
orl %ecx, %eax
andl $0xF0F0F0F, %eax
movl %eax, %ecx
shrl $4, %ecx
orl %eax, %ecx
movzbl %cl, %esi
shrl $8, %ecx
andl $0xFF00, %ecx
orl %ecx, %esi
TL;DR source repo and live demo.
Basically every step in the morton1
function works by shifting and adding to a sequence of constants which look like this:
0b0101010101010101
(alternate 1 and 0)0b0011001100110011
(alternate 2x 1 and 0)0b0000111100001111
(alternate 4x 1 and 0)0b0000000011111111
(alternate 8x 1 and 0)If we were to use D
dimensions, we would have a pattern with D-1
zeros and 1
one. So to generate these it's enough to generate consecutive ones and apply some bitwise or:
/// @brief Generates 0b1...1 with @tparam n ones
template <class T, unsigned n>
using n_ones = std::integral_constant<T, (~static_cast<T>(0) >> (sizeof(T) * 8 - n))>;
/// @brief Performs `@tparam input | (@tparam input << @tparam width` @tparam repeat times.
template <class T, T input, unsigned width, unsigned repeat>
struct lshift_add :
public lshift_add<T, lshift_add<T, input, width, 1>::value, width, repeat - 1> {
};
/// @brief Specialization for 1 repetition, just does the shift-and-add operation.
template <class T, T input, unsigned width>
struct lshift_add<T, input, width, 1> : public std::integral_constant<T,
(input & n_ones<T, width>::value) | (input << (width < sizeof(T) * 8 ? width : 0))> {
};
Now that we can generate the constants at compile time for arbitrary dimensions with the following:
template <class T, unsigned step, unsigned dimensions = 2u>
using mask = lshift_add<T, n_ones<T, 1 << step>::value, dimensions * (1 << step), sizeof(T) * 8 / (2 << step)>;
With the same type of recursion, we can generate functions for each of the steps of the algorithm x = (x | (x >> K)) & M
:
template <class T, unsigned step, unsigned dimensions>
struct deinterleave {
static T work(T input) {
input = deinterleave<T, step - 1, dimensions>::work(input);
return (input | (input >> ((dimensions - 1) * (1 << (step - 1))))) & mask<T, step, dimensions>::value;
}
};
// Omitted specialization for step 0, where there is just a bitwise and
It remains to answer the question "how many steps do we need?". This depends also on the number of dimensions. In general, k
steps compute 2^k - 1
output bits; the maximum number of meaningful bits for each dimension is given by z = sizeof(T) * 8 / dimensions
, therefore it is enough to take 1 + log_2 z
steps. The problem is now that we need this as constexpr
in order to use it as a template parameter. The best way I found to work around this is to define log2
via metaprogramming:
template <unsigned arg>
struct log2 : public std::integral_constant<unsigned, log2<(arg >> 1)>::value + 1> {};
template <>
struct log2<1u> : public std::integral_constant<unsigned, 0u> {};
/// @brief Helper constexpr which returns the number of steps needed to fully interleave a type @tparam T.
template <class T, unsigned dimensions>
using num_steps = std::integral_constant<unsigned, log2<sizeof(T) * 8 / dimensions>::value + 1>;
And finally, we can perform one single call:
/// @brief Helper function which combines @see deinterleave and @see num_steps into a single call.
template <class T, unsigned dimensions>
T deinterleave_first(T n) {
return deinterleave<T, num_steps<T, dimensions>::value - 1, dimensions>::work(n);
}
Code for the Intel Haswell and later CPUs. You can use the BMI2 instruction set which contains the pext and pdep instructions. These can (among other great things) be used to build your functions.
#include <immintrin.h>
#include <stdint.h>
// on GCC, compile with option -mbmi2, requires Haswell or better.
uint64_t xy_to_morton (uint32_t x, uint32_t y)
{
return _pdep_u32(x, 0x55555555) | _pdep_u32(y,0xaaaaaaaa);
}
uint64_t morton_to_xy (uint64_t m, uint32_t *x, uint32_t *y)
{
*x = _pext_u64(m, 0x5555555555555555);
*y = _pext_u64(m, 0xaaaaaaaaaaaaaaaa);
}
If you need speed than you can use table-lookup for one byte conversion at once (two bytes table is faster but to big). Procedure is made under Delphi IDE but the assembler/algorithem is the same.
const
MortonTableLookup : array[byte] of byte = ($00, $01, $10, $11, $12, ... ;
procedure DeinterleaveBits(Input: cardinal);
//In: eax
//Out: dx = EvenBits; ax = OddBits;
asm
movzx ecx, al //Use 0th byte
mov dl, byte ptr[MortonTableLookup + ecx]
//
shr eax, 8
movzx ecx, ah //Use 2th byte
mov dh, byte ptr[MortonTableLookup + ecx]
//
shl edx, 16
movzx ecx, al //Use 1th byte
mov dl, byte ptr[MortonTableLookup + ecx]
//
shr eax, 8
movzx ecx, ah //Use 3th byte
mov dh, byte ptr[MortonTableLookup + ecx]
//
mov ecx, edx
and ecx, $F0F0F0F0
mov eax, ecx
rol eax, 12
or eax, ecx
rol edx, 4
and edx, $F0F0F0F0
mov ecx, edx
rol ecx, 12
or edx, ecx
end;