问题
I have an array of bytes (unsigned char *
) that must be converted to integer. Integers are represented over three bytes. This is what I have done
//bytes array is allocated and filled
//allocating space for intBuffer (uint32_t)
unsigned long i = 0;
uint32_t number;
for(; i<size_tot; i+=3){
uint32_t number = (bytes[i]<<16) | (bytes[i+1]<<8) | bytes[i+2];
intBuffer[number]++;
}
This piece of code does its jobs well but it is incredibly slow due to the three accesses in memory (especially for big values of size_tot
, in the order of 3000000
). Is there a way to do it faster and increase performance?
回答1:
The correct answer is almost always:
Write correct code, enable optimisations, trust your compiler.
given:
void count_values(std::array<uint32_t, 256^3>& results,
const unsigned char* from,
const unsigned char* to)
{
for(; from != to; from = std::next(from, 3)) {
++results[(*from << 16) | (*std::next(from, 1) << 8) | *(std::next(from,2))];
}
}
compiled with -O3
Yields (explanatory comments inlined):
__Z12count_valuesRNSt3__15arrayIjLm259EEEPKhS4_: ## @_Z12count_valuesRNSt3__15arrayIjLm259EEEPKhS4_
.cfi_startproc
## BB#0:
pushq %rbp
Ltmp0:
.cfi_def_cfa_offset 16
Ltmp1:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp2:
.cfi_def_cfa_register %rbp
jmp LBB0_2
.align 4, 0x90
LBB0_1: ## %.lr.ph
## in Loop: Header=BB0_2 Depth=1
# dereference from and extend the 8-bit value to 32 bits
movzbl (%rsi), %eax
shlq $16, %rax # shift left 16
movzbl 1(%rsi), %ecx # dereference *(from+1) and extend to 32bits by padding with zeros
shlq $8, %rcx # shift left 8
orq %rax, %rcx # or into above result
movzbl 2(%rsi), %eax # dreference *(from+2) and extend to 32bits
orq %rcx, %rax # or into above result
incl (%rdi,%rax,4) # increment the correct counter
addq $3, %rsi # from += 3
LBB0_2: ## %.lr.ph
## =>This Inner Loop Header: Depth=1
cmpq %rdx, %rsi # while from != to
jne LBB0_1
## BB#3: ## %._crit_edge
popq %rbp
retq
.cfi_endproc
Notice that there is no need to stray away from standard constructs or standard calls. The compiler produces perfect code.
To further prove the point, let's go crazy and write a custom iterator that allows us to reduce the function to this:
void count_values(std::array<uint32_t, 256^3>& results,
byte_triple_iterator from,
byte_triple_iterator to)
{
assert(iterators_correct(from, to));
while(from != to) {
++results[*from++];
}
}
And here is a (basic) implementation of such an iterator:
struct byte_triple_iterator
{
constexpr byte_triple_iterator(const std::uint8_t* p)
: _ptr(p)
{}
std::uint32_t operator*() const noexcept {
return (*_ptr << 16) | (*std::next(_ptr, 1) << 8) | *(std::next(_ptr,2));
}
byte_triple_iterator& operator++() noexcept {
_ptr = std::next(_ptr, 3);
return *this;
}
byte_triple_iterator operator++(int) noexcept {
auto copy = *this;
_ptr = std::next(_ptr, 3);
return copy;
}
constexpr const std::uint8_t* byte_ptr() const {
return _ptr;
}
private:
friend bool operator<(const byte_triple_iterator& from, const byte_triple_iterator& to)
{
return from._ptr < to._ptr;
}
friend bool operator==(const byte_triple_iterator& from, const byte_triple_iterator& to)
{
return from._ptr == to._ptr;
}
friend bool operator!=(const byte_triple_iterator& from, const byte_triple_iterator& to)
{
return not(from == to);
}
friend std::ptrdiff_t byte_difference(const byte_triple_iterator& from, const byte_triple_iterator& to)
{
return to._ptr - from._ptr;
}
const std::uint8_t* _ptr;
};
bool iterators_correct(const byte_triple_iterator& from,
const byte_triple_iterator& to)
{
if (not(from < to))
return false;
auto dist = to.byte_ptr() - from.byte_ptr();
return dist % 3 == 0;
}
Now what to we have?
- an assert to check that our source is indeed exactly the correct length (in debug build)
- an output structure that is guaranteed to be the right size
But what's it done to our object code? (compile with -O3 -DNDEBUG
)
.globl __Z12count_valuesRNSt3__15arrayIjLm259EEE20byte_triple_iteratorS3_
.align 4, 0x90
__Z12count_valuesRNSt3__15arrayIjLm259EEE20byte_triple_iteratorS3_: ## @_Z12count_valuesRNSt3__15arrayIjLm259EEE20byte_triple_iteratorS3_
.cfi_startproc
## BB#0:
pushq %rbp
Ltmp3:
.cfi_def_cfa_offset 16
Ltmp4:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp5:
.cfi_def_cfa_register %rbp
jmp LBB1_2
.align 4, 0x90
LBB1_1: ## %.lr.ph
## in Loop: Header=BB1_2 Depth=1
movzbl (%rsi), %eax
shlq $16, %rax
movzbl 1(%rsi), %ecx
shlq $8, %rcx
orq %rax, %rcx
movzbl 2(%rsi), %eax
orq %rcx, %rax
incl (%rdi,%rax,4)
addq $3, %rsi
LBB1_2: ## %.lr.ph
## =>This Inner Loop Header: Depth=1
cmpq %rdx, %rsi
jne LBB1_1
## BB#3: ## %._crit_edge
popq %rbp
retq
.cfi_endproc
Answer: nothing - it's just as efficient.
The lesson? No really! Trust your compiler!!!
回答2:
Assuming you want to do a count of all the distinct values (your code: intBuffer[number]++;
) (with intBuffer having 2^24 items), you could try to do some loop unrolling:
Instead of:
for(; i<size_tot; i+=3){
uint32_t number = (bytes[i]<<16) | (bytes[i+1]<<8) | bytes[i+2];
intBuffer[number]++;
}
do:
for(; i<size_tot; i+=12){ // add extra ckeck here..
intBuffer[(bytes[i]<<16) | (bytes[i+1]<<8) | bytes[i+2]]++;
intBuffer[(bytes[i+3]<<16) | (bytes[i+4]<<8) | bytes[i+5]]++;
intBuffer[(bytes[i+6]<<16) | (bytes[i+7]<<8) | bytes[i+8]]++;
intBuffer[(bytes[i+9]<<16) | (bytes[i+10]<<8) | bytes[i+11]]++;
}
// Add a small loop for the remaining bytes (no multiple of 12)
This would allow the cpu to execute multiple instructions in one clock cycle (make sure to set compiler optimization at highest level).
You also need an extra check for the last part of bytes
.
Check out Instruction Pipelining.
Instruction pipelining is a technique that implements a form of parallelism called instruction-level parallelism within a single processor. It therefore allows faster CPU throughput (the number of instructions that can be executed in a unit of time) than would otherwise be possible at a given clock rate. The basic instruction cycle is broken up into a series called a pipeline. Rather than processing each instruction sequentially (finishing one instruction before starting the next), each instruction is split up into a sequence of steps so different steps can be executed in parallel and instructions can be processed concurrently (starting one instruction before finishing the previous one).
Update:
but it is incredibly slow
Actually, for 3MB this should be somewhat instant, even with your original code (considering the data is already cached). How is bytes
defined? Could it be that operator[]
is doing some extra bounds checking?
回答3:
First of all make sure compiler optimization is turned to the highest level.
I think I would give this a try:
unsigned char* pBytes = bytes;
uint32_t number;
for(unsigned long i = 0; i<size_tot; i+=3){
number = *pBytes << 16;
++pBytes;
number = number | (*pBytes << 8);
++pBytes;
number = number | *pBytes;
++pBytes;
++intBuffer[number];
}
After compiling I would check how the produced assembler code looked to see if the changed actually made a difference.
回答4:
Try to read a word at a time and then extract the desired value. That should be more efficient than reading byte-by-byte
Here's a sample implementation on 64-bit little-endian systems which will read 3 64-bit values at a time
void count(uint8_t* bytes, int* intBuffer, uint32_t size_tot)
{
assert(size_tot > 7);
uint64_t num1, num2, num3;
uint8_t *bp = bytes;
while ((uintptr_t)bp % 8) // make sure that the pointer is properly aligned
{
num1 = (bp[2] << 16) | (bp[1] << 8) | bp[0];
intBuffer[num1]++;
bp += 3;
}
uint64_t* ip = (uint64_t*)bp;
while ((uint8_t*)(ip + 2) < bytes + size_tot)
{
num1 = *ip++;
num2 = *ip++;
num3 = *ip++;
intBuffer[num1 & 0xFFFFFF]++;
intBuffer[(num1 >> 24) & 0xFFFFFF]++;
intBuffer[(num1 >> 48) | ((num2 & 0xFF) << 16)]++;
intBuffer[(num2 >> 8) & 0xFFFFFF]++;
intBuffer[(num2 >> 32) & 0xFFFFFF]++;
intBuffer[(num2 >> 56) | ((num3 & 0xFFFF) << 8)]++;
intBuffer[(num3 >> 16) & 0xFFFFFF]++;
intBuffer[num3 >> 40]++;
}
bp = (uint8_t*)ip;
while (bp < bytes + size_tot)
{
num1 = (bp[2] << 16) | (bp[1] << 8) | bp[0];
intBuffer[num1]++;
bp += 3;
}
}
You can check the compiler output on Compiler Explorer. Of course smart compilers may already know how to do that, but most don't. As you can see from the Godbolt link, compilers will use a bunch of movzx to read the separate bytes instead of reading the whole register. ICC will do a lot more loop unrolling but Clang and GCC don't
Similarly for 32-bit architectures you'll also read 3 "words" each iteration. Besides, you may need to do some manual loop unrolling instead of relying on the compiler to do that. Here's an example on 32-bit little endian machines. It can be easily adapted for big endian like this
intBuffer[num1 >> 8]++;
intBuffer[((num1 & 0xFF) << 16) | (num2 >> 16)]++;
intBuffer[((num2 & 0xFFFF) << 8) | (num3 >> 24)]++;
intBuffer[num3 & 0xFFFFFF]++;
But for more performance you may want to sought out to a SIMD solution like SSE or AVX
来源:https://stackoverflow.com/questions/34566603/fastest-way-to-convert-bytes-to-unsigned-int