问题
There is a big (~100 000) array of floating point variables, and there is a threshold (also floating point).
The problem is that I have to compare each one variable from the array with a threshold, but NEON flags transfer takes a really long time (~20 cycles in accordance to a profiler).
Is there any efficient way to compare these values?
NOTE: As rounding error doesn't matter, I tried the following:
float arr[10000];
float threshold;
....
int a = arr[20]; // e.g.
int t = threshold;
if (t > a) {....}
But in this case I getting the following processor command sequence:
vldr.32 s0, [r0]
vcvt.s32.f32 s0, s0
vmov r0, s0 <--- takes 20 cycles as `vmrs APSR_nzcv, fpscr` in case of
cmp r0, r1 floating point comparison
As conversion happens at NEON, there is no matter if I compare integers, by described way or floats.
回答1:
If floats are 32-bit IEEE-754 and ints are 32-bit too and if there are no +infinity, -infinity and NaN
values, we can compare floats as ints with a little trick:
#include <stdio.h>
#include <limits.h>
#include <assert.h>
#define C_ASSERT(expr) extern char CAssertExtern[(expr)?1:-1]
C_ASSERT(sizeof(int) == sizeof(float));
C_ASSERT(sizeof(int) * CHAR_BIT == 32);
int isGreater(float* f1, float* f2)
{
int i1, i2, t1, t2;
i1 = *(int*)f1;
i2 = *(int*)f2;
t1 = i1 >> 31;
i1 = (i1 ^ t1) + (t1 & 0x80000001);
t2 = i2 >> 31;
i2 = (i2 ^ t2) + (t2 & 0x80000001);
return i1 > i2;
}
int main(void)
{
float arr[9] = { -3, -2, -1.5, -1, 0, 1, 1.5, 2, 3 };
float thr;
int i;
// Make sure floats are 32-bit IEE754 and
// reinterpreted as integers as we want/expect
{
static const float testf = 8873283.0f;
unsigned testi = *(unsigned*)&testf;
assert(testi == 0x4B076543);
}
thr = -1.5;
for (i = 0; i < 9; i++)
{
printf("%f %s %f\n", arr[i], "<=\0> " + 3*isGreater(&arr[i], &thr), thr);
}
thr = 1.5;
for (i = 0; i < 9; i++)
{
printf("%f %s %f\n", arr[i], "<=\0> " + 3*isGreater(&arr[i], &thr), thr);
}
return 0;
}
Output:
-3.000000 <= -1.500000
-2.000000 <= -1.500000
-1.500000 <= -1.500000
-1.000000 > -1.500000
0.000000 > -1.500000
1.000000 > -1.500000
1.500000 > -1.500000
2.000000 > -1.500000
3.000000 > -1.500000
-3.000000 <= 1.500000
-2.000000 <= 1.500000
-1.500000 <= 1.500000
-1.000000 <= 1.500000
0.000000 <= 1.500000
1.000000 <= 1.500000
1.500000 <= 1.500000
2.000000 > 1.500000
3.000000 > 1.500000
Of course, it makes sense to precalculate that final integer value in isGreater()
that's used in the comparison operator if your threshold doesn't change.
If you are afraid of undefined behavior in C/C++ in the above code, you can rewrite the code in assembly.
回答2:
If your data is float then you should do your comparisons with floats, e.g.
float arr[10000];
float threshold;
....
float a = arr[20]; // e.g.
if (threshold > a) {....}
otherwise you will have expensive float-int conversions.
回答3:
Your example shows how bad compiler-generated codes can be :
It loads a value with NEON just to convert it to int, then does a NEON->ARM transfer that causes a pipeline flush resulting in 11~14 cycles wasted.
The best solution would be writing the function completely in hand assembly.
However, there is a simple trick that allows fast float comparisons without typecasting AND truncation:
Threshold positive (exactly as fast as int comparison) :
void example(float * pSrc, float threshold, unsigned int count)
{
typedef union {
int ival,
unsigned int uval,
float fval
} unitype;
unitype v, t;
if (count==0) return;
t.fval = threshold;
do {
v.fval = *pSrc++;
if (v.ival < t.ival) {
// your code here
}
else {
// your code here (optional)
}
} while (--count);
}
Threshold negative (1 cycle more per value than int comparison):
void example(float * pSrc, float threshold, unsigned int count)
{
typedef union {
int ival,
unsigned int uval,
float fval
} unitype;
unitype v, t, temp;
if (count==0) return;
t.fval = threshold;
t.uval &= 0x7fffffff;
do {
v.fval = *pSrc++;
temp.uval = v.uval ^ 0x80000000;
if (temp.ival >= t.ival) {
// your code here
}
else {
// your code here (optional)
}
} while (--count);
}
I think it to be quite a lot faster than the accepted one above. Again, I'm a bit too late.
回答4:
If the rounding errors do not matter, then you should use std::lrint.
The Faster Floating Point to Integer Conversions recommends to use it for float to int conversion.
来源:https://stackoverflow.com/questions/10381927/efficient-floating-point-comparison-cortex-a8