I have an array of char (usually thousands of bytes long) read from a file, all composed of 0 and 1 (not \'0\' and \'1\', in which case I could use strtoul
). I want
If you don't need the output bits to appear in exactly the same order as the input bytes, but if they can instead be "interleaved" in a specific way, then a fast and portable way to accomplish this is to take 8 blocks of 8 bytes (64 bytes total) and to combine all the LSBs together into a single 8 byte value.
Something like:
uint32_t extract_lsbs2(uint8_t (&input)[32]) {
uint32_t t0, t1, t2, t3, t4, t5, t6, t7;
memcpy(&t0, input + 0 * 4, 4);
memcpy(&t1, input + 1 * 4, 4);
memcpy(&t2, input + 2 * 4, 4);
memcpy(&t3, input + 3 * 4, 4);
memcpy(&t4, input + 4 * 4, 4);
memcpy(&t5, input + 5 * 4, 4);
memcpy(&t6, input + 6 * 4, 4);
memcpy(&t7, input + 7 * 4, 4);
return
(t0 << 0) |
(t1 << 1) |
(t2 << 2) |
(t3 << 3) |
(t4 << 4) |
(t5 << 5) |
(t6 << 6) |
(t7 << 7);
}
This generates "not terrible, not great" code on most compilers.
If you use uint64_t
instead of uint32_t
it would generally be twice as fast (assuming you have more than 32 total bytes to transform) on a 64-bit platform.
With SIMD you could easy vectorize the entire operation in something like two instructions (for AVX2, but any x86 SIMD ISA will work): compare and pmovmskb
.