I am trying to generate non-sequential human readable order codes derived from (lets say) a unsigned 32bit internal id that starts at 1 and is auto incremented for each new
In my example code below, will every
$hash
be unique?
Almost. (Which, I guess, means "no, but in a way that's easily fixed".) Your function consists of a sequence of independent steps; the overall function is bijective (reversible) if and only if every single one of those steps is. (Do you see why?)
Now, each step has one of the following forms:
$key = ($key ^ CONSTANT) ^ ($key >> NUM_BITS);
$key = ($key ^ CONSTANT) ^ ($key << NUM_BITS);
with NUM_BITS != 0
.
We can actually treat these as variants of a single form, by viewing the former as almost equivalent to this:
$key = invert_order_of_bits($key); # clearly bijective
$constant = invert_order_of_bits(CONSTANT);
$key = ($key ^ $constant) ^ ($key << NUM_BITS);
$key = invert_order_of_bits($key); # clearly bijective
So all we need is to show that this:
$key = ($key ^ CONSTANT) ^ ($key << NUM_BITS);
is bijective. Now, XOR is commutative and associative, so the above is equivalent to this:
$key = $key ^ ($key << NUM_BITS);
$key = $key ^ CONSTANT;
and (x ^ y) ^ y == x ^ (y ^ y) == x ^ 0 == x
, so clearly XOR-ing with a constant value is reversible (by re-XOR-ing with the same value); so all we have to show is that this is bijective:
$key = $key ^ ($key << NUM_BITS);
whenever NUM_BITS != 0
.
Now, I'm not writing a rigorous proof, so I'll just give a single reasoned-out example of how to reverse this. Suppose that $key ^ ($key << 9)
is
0010 1010 1101 1110 0010 0101 0000 1100
How do we obtain $key
? Well, we know that the last nine bits of $key << 9
are all zeroes, so we know that the last nine bits of $key ^ ($key << 9)
are the same as the last nine bits of $key
. So $key
looks like
bbbb bbbb bbbb bbbb bbbb bbb1 0000 1100
so $key << 9
looks like
bbbb bbbb bbbb bb10 0001 1000 0000 0000
so $key
looks like
bbbb bbbb bbbb bb00 0011 1101 0000 1100
(by XOR-ing $key ^ ($key << 9)
with $key << 9
), so $key << 9
looks like
bbbb b000 0111 1010 0001 1000 0000 0000
so $key
looks like
bbbb b010 1010 0100 0011 1101 0000 1100
so $key << 9
looks like
0101 1000 0111 1010 0001 1000 0000 0000
so $key
looks like
0111 0010 1010 0100 0011 1101 0000 1100
So . . . why do I say "almost" rather than "yes"? Why is your hash-function not perfectly bijective? It's because in PHP, the bitwise shift operators >>
and <<
are not quite symmetric, and while $key = $key ^ ($key << NUM_BITS)
is completely reversible, $key = $key ^ ($key >> NUM_BITS)
is not. (Above, when I wrote that the two types of steps were "almost equivalent", I really meant that "almost". It makes a difference!) You see, whereas <<
treats the sign bit just like any other bit, and shifts it out of existence (bringing in a zero-bit on the right), >>
treats the sign bit specially, and "extends" it: the bit that it brings in on the left is equal to the sign bit. (N.B. Your question mentions "unsigned 32bit" values, but PHP doesn't actually support that; its bitwise operations are always on signed integers.)
Due to this sign extension, if $key
starts with a 0
, then $key >> NUM_BITS
starts with a 0
, and if $key
starts with a 1
, then $key >> NUM_BITS
also starts with a 1
. In either case, $key ^ ($key >> NUM_BITS)
will start with a 0
. You've lost exactly one bit of entropy. If you give me $key ^ ($key >> 9)
, and don't tell me whether $key
is negative, then the best I can do is compute two possible values for $key
: one negative, one positive-or-zero.
You perform two steps that use right-shift instead of left-shift, so you lose two bits of entropy. (I'm hand-waving slightly — all I've actually demonstrated is that you lose at least one bit and at most two bits — but I'm confident that, due to the nature of the steps between those right-shift steps, you do actually lose two full bits.) For any given output value, there are exactly four distinct input-values that could yield it. So it's not unique, but it's almost unique; and it's easily fixed, by either: