Fastest way to clamp a real (fixed/floating point) value?

六月ゝ 毕业季﹏ 提交于 2019-11-26 17:29:58

For the 16.16 representation, the simple ternary is unlikely to be bettered speed-wise.

And for doubles, because you need it standard/portable C, bit-fiddling of any kind will end badly.

Even if a bit-fiddle was possible (which I doubt), you'd be relying on the binary representation of doubles. THIS (and their size) IS IMPLEMENTATION-DEPENDENT.

Possibly you could "guess" this using sizeof(double) and then comparing the layout of various double values against their common binary representations, but I think you're on a hiding to nothing.

The best rule is TELL THE COMPILER WHAT YOU WANT (ie ternary), and let it optimise for you.

EDIT: Humble pie time. I just tested quinmars idea (below), and it works - if you have IEEE-754 floats. This gave a speedup of about 20% on the code below. IObviously non-portable, but I think there may be a standardised way of asking your compiler if it uses IEEE754 float formats with a #IF...?

  double FMIN = 3.13;
  double FMAX = 300.44;

  double FVAL[10] = {-100, 0.23, 1.24, 3.00, 3.5, 30.5, 50 ,100.22 ,200.22, 30000};
  uint64  Lfmin = *(uint64 *)&FMIN;
  uint64  Lfmax = *(uint64 *)&FMAX;

    DWORD start = GetTickCount();

    for (int j=0; j<10000000; ++j)
    {
        uint64 * pfvalue = (uint64 *)&FVAL[0];
        for (int i=0; i<10; ++i)
            *pfvalue++ = (*pfvalue < Lfmin) ? Lfmin : (*pfvalue > Lfmax) ? Lfmax : *pfvalue;
    }

    volatile DWORD hacktime = GetTickCount() - start;

    for (int j=0; j<10000000; ++j)
    {
        double * pfvalue = &FVAL[0];
        for (int i=0; i<10; ++i)
            *pfvalue++ = (*pfvalue < FMIN) ? FMIN : (*pfvalue > FMAX) ? FMAX : *pfvalue;
    }

    volatile DWORD normaltime = GetTickCount() - (start + hacktime);

Old question, but I was working on this problem today (with doubles/floats).

The best approach is to use SSE MINSS/MAXSS for floats and SSE2 MINSD/MAXSD for doubles. These are branchless and take one clock cycle each, and are easy to use thanks to compiler intrinsics. They confer more than an order of magnitude increase in performance compared with clamping with std::min/max.

You may find that surprising. I certainly did! Unfortunately VC++ 2010 uses simple comparisons for std::min/max even when /arch:SSE2 and /FP:fast are enabled. I can't speak for other compilers.

Here's the necessary code to do this in VC++:

#include <mmintrin.h>

float minss ( float a, float b )
{
    // Branchless SSE min.
    _mm_store_ss( &a, _mm_min_ss(_mm_set_ss(a),_mm_set_ss(b)) );
    return a;
}

float maxss ( float a, float b )
{
    // Branchless SSE max.
    _mm_store_ss( &a, _mm_max_ss(_mm_set_ss(a),_mm_set_ss(b)) );
    return a;
}

float clamp ( float val, float minval, float maxval )
{
    // Branchless SSE clamp.
    // return minss( maxss(val,minval), maxval );

    _mm_store_ss( &val, _mm_min_ss( _mm_max_ss(_mm_set_ss(val),_mm_set_ss(minval)), _mm_set_ss(maxval) ) );
    return val;
}

The double precision code is the same except with xxx_sd instead.

Edit: Initially I wrote the clamp function as commented. But looking at the assembler output I noticed that the VC++ compiler wasn't smart enough to cull the redundant move. One less instruction. :)

Jorge

Both GCC and clang generate beautiful assembly for the following simple, straightforward, portable code:

double clamp(double d, double min, double max) {
  const double t = d < min ? min : d;
  return t > max ? max : t;
}

> gcc -O3 -march=native -Wall -Wextra -Wc++-compat -S -fverbose-asm clamp_ternary_operator.c

GCC-generated assembly:

maxsd   %xmm0, %xmm1    # d, min
movapd  %xmm2, %xmm0    # max, max
minsd   %xmm1, %xmm0    # min, max
ret

> clang -O3 -march=native -Wall -Wextra -Wc++-compat -S -fverbose-asm clamp_ternary_operator.c

Clang-generated assembly:

maxsd   %xmm0, %xmm1
minsd   %xmm1, %xmm2
movaps  %xmm2, %xmm0
ret

Three instructions (not counting the ret), no branches. Excellent.

This was tested with GCC 4.7 and clang 3.2 on Ubuntu 13.04 with a Core i3 M 350. On a side note, the straightforward C++ code calling std::min and std::max generated the same assembly.

This is for doubles. And for int, both GCC and clang generate assembly with five instructions (not counting the ret) and no branches. Also excellent.

I don't currently use fixed-point, so I will not give an opinion on fixed-point.

If your processor has a fast instruction for absolute value (as the x86 does), you can do a branchless min and max which will be faster than an if statement or ternary operation.

min(a,b) = (a + b - abs(a-b)) / 2
max(a,b) = (a + b + abs(a-b)) / 2

If one of the terms is zero (as is often the case when you're clamping) the code simplifies a bit further:

max(a,0) = (a + abs(a)) / 2

When you're combining both operations you can replace the two /2 into a single /4 or *0.25 to save a step.

The following code is over 3x faster than ternary on my Athlon II X2, when using the optimization for FMIN=0.

double clamp(double value)
{
    double temp = value + FMAX - abs(value-FMAX);
#if FMIN == 0
    return (temp + abs(temp)) * 0.25;
#else
    return (temp + (2.0*FMIN) + abs(temp-(2.0*FMIN))) * 0.25;
#endif
}

Ternary operator is really the way to go, because most compilers are able to compile them into a native hardware operation that uses a conditional move instead of a branch (and thus avoids the mispredict penalty and pipeline bubbles and so on). Bit-manipulation is likely to cause a load-hit-store.

In particular, PPC and x86 with SSE2 have a hardware op that could be expressed as an intrinsic something like this:

double fsel( double a, double b, double c ) {
  return a >= 0 ? b : c; 
}

The advantage is that it does this inside the pipeline, without causing a branch. In fact, if your compiler uses the intrinsic, you can use it to implement your clamp directly:

inline double clamp ( double a, double min, double max ) 
{
   a = fsel( a - min , a, min );
   return fsel( a - max, max, a );
}

I strongly suggest you avoid bit-manipulation of doubles using integer operations. On most modern CPUs there is no direct means of moving data between double and int registers other than by taking a round trip to the dcache. This will cause a data hazard called a load-hit-store which basically empties out the CPU pipeline until the memory write has completed (usually around 40 cycles or so).

The exception to this is if the double values are already in memory and not in a register: in that case there is no danger of a load-hit-store. However your example indicates you've just calculated the double and returned it from a function which means it's likely to still be in XMM1.

The bits of IEEE 754 floating point are ordered in a way that if you compare the bits interpreted as an integer you get the same results as if you would compare them as floats directly. So if you find or know a way to clamp integers you can use it for (IEEE 754) floats as well. Sorry, I don't know a faster way.

If you have the floats stored in an arrays you can consider to use some CPU extensions like SSE3, as rkj said. You can take a look at liboil it does all the dirty work for you. Keeps your program portable and uses faster cpu instructions if possible. (I'm not sure tho how OS/compiler-independent liboil is).

Linasses

Rather than testing and branching, I normally use this format for clamping:

clampedA = fmin(fmax(a,MY_MIN),MY_MAX);

Although I have never done any performance analysis on the compiled code.

MSalters

Realistically, no decent compiler will make a difference between an if() statement and a ?: expression. The code is simple enough that they'll be able to spot the possible paths. That said, your two examples are not identical. The equivalent code using ?: would be

a = (a > MAX) ? MAX : ((a < MIN) ? MIN : a);

as that avoid the A < MIN test when a > MAX. Now that could make a difference, as the compiler otherwise would have to spot the relation between the two tests.

If clamping is rare, you can test the need to clamp with a single test:

if (abs(a - (MAX+MIN)/2) > ((MAX-MIN)/2)) ...

E.g. with MIN=6 and MAX=10, this will first shift a down by 8, then check if it lies between -2 and +2. Whether this saves anything depends a lot on the relative cost of branching.

jfs

Here's a possibly faster implementation similar to @Roddy's answer:

typedef int64_t i_t;
typedef double  f_t;

static inline
i_t i_tmin(i_t x, i_t y) {
  return (y + ((x - y) & -(x < y))); // min(x, y)
}

static inline
i_t i_tmax(i_t x, i_t y) {
  return (x - ((x - y) & -(x < y))); // max(x, y)
}

f_t clip_f_t(f_t f, f_t fmin, f_t fmax)
{
#ifndef TERNARY
  assert(sizeof(i_t) == sizeof(f_t));
  //assert(not (fmin < 0 and (f < 0 or is_negative_zero(f))));
  //XXX assume IEEE-754 compliant system (lexicographically ordered floats)
  //XXX break strict-aliasing rules
  const i_t imin = *(i_t*)&fmin;
  const i_t imax = *(i_t*)&fmax;
  const i_t i    = *(i_t*)&f;
  const i_t iclipped = i_tmin(imax, i_tmax(i, imin));

#ifndef INT_TERNARY
  return *(f_t *)&iclipped;
#else /* INT_TERNARY */
  return i < imin ? fmin : (i > imax ? fmax : f); 
#endif /* INT_TERNARY */

#else /* TERNARY */
  return fmin > f ? fmin : (fmax < f ? fmax : f);
#endif /* TERNARY */
}

See Compute the minimum (min) or maximum (max) of two integers without branching and Comparing floating point numbers

The IEEE float and double formats were designed so that the numbers are “lexicographically ordered”, which – in the words of IEEE architect William Kahan means “if two floating-point numbers in the same format are ordered ( say x < y ), then they are ordered the same way when their bits are reinterpreted as Sign-Magnitude integers.”

A test program:

/** gcc -std=c99 -fno-strict-aliasing -O2 -lm -Wall *.c -o clip_double && clip_double */
#include <assert.h> 
#include <iso646.h>  // not, and
#include <math.h>    // isnan()
#include <stdbool.h> // bool
#include <stdint.h>  // int64_t
#include <stdio.h>

static 
bool is_negative_zero(f_t x) 
{
  return x == 0 and 1/x < 0;
}

static inline 
f_t range(f_t low, f_t f, f_t hi) 
{
  return fmax(low, fmin(f, hi));
}

static const f_t END = 0./0.;

#define TOSTR(f, fmin, fmax, ff) ((f) == (fmin) ? "min" :       \
                  ((f) == (fmax) ? "max" :      \
                   (is_negative_zero(ff) ? "-0.":   \
                    ((f) == (ff) ? "f" : #f))))

static int test(f_t p[], f_t fmin, f_t fmax, f_t (*fun)(f_t, f_t, f_t)) 
{
  assert(isnan(END));
  int failed_count = 0;
  for ( ; ; ++p) {
    const f_t clipped  = fun(*p, fmin, fmax), expected = range(fmin, *p, fmax);
    if(clipped != expected and not (isnan(clipped) and isnan(expected))) {
      failed_count++;
      fprintf(stderr, "error: got: %s, expected: %s\t(min=%g, max=%g, f=%g)\n", 
          TOSTR(clipped,  fmin, fmax, *p), 
          TOSTR(expected, fmin, fmax, *p), fmin, fmax, *p);
    }
    if (isnan(*p))
      break;
  }
  return failed_count;
}  

int main(void)
{
  int failed_count = 0;
  f_t arr[] = { -0., -1./0., 0., 1./0., 1., -1., 2, 
        2.1, -2.1, -0.1, END};
  f_t minmax[][2] = { -1, 1,  // min, max
               0, 2, };

  for (int i = 0; i < (sizeof(minmax) / sizeof(*minmax)); ++i) 
    failed_count += test(arr, minmax[i][0], minmax[i][1], clip_f_t);      

  return failed_count & 0xFF;
}

In console:

$ gcc -std=c99 -fno-strict-aliasing -O2 -lm *.c -o clip_double && ./clip_double 

It prints:

error: got: min, expected: -0.  (min=-1, max=1, f=0)
error: got: f, expected: min    (min=-1, max=1, f=-1.#INF)
error: got: f, expected: min    (min=-1, max=1, f=-2.1)
error: got: min, expected: f    (min=-1, max=1, f=-0.1)

I tried the SSE approach to this myself, and the assembly output looked quite a bit cleaner, so I was encouraged at first, but after timing it thousands of times, it was actually quite a bit slower. It does indeed look like the VC++ compiler isn't smart enough to know what you're really intending, and it appears to move things back and forth between the XMM registers and memory when it shouldn't. That said, I don't know why the compiler isn't smart enough to use the SSE min/max instructions on the ternary operator when it seems to use SSE instructions for all floating point calculations anyway. On the other hand, if you're compiling for PowerPC, you can use the fsel intrinsic on the FP registers, and it's way faster.

If I understand properly, you want to limit a value "a" to a range between MY_MIN and MY_MAX. The type of "a" is a double. You did not specify the type of MY_MIN or MY_MAX.

The simple expression:

clampedA = (a > MY_MAX)? MY_MAX : (a < MY_MIN)? MY_MIN : a;

should do the trick.

I think there may be a small optimization to be made if MY_MAX and MY_MIN happen to be integers:

int b = (int)a;
clampedA = (b > MY_MAX)? (double)MY_MAX : (b < MY_MIN)? (double)MY_MIN : a;

By changing to integer comparisons, it is possible you might get a slight speed advantage.

If you want to use fast absolute value instructions, check out this snipped of code I found in minicomputer, which clamps a float to the range [0,1]

clamped = 0.5*(fabs(x)-fabs(x-1.0f) + 1.0f);

(I simplified the code a bit). We can think about it as taking two values, one reflected to be >0

fabs(x)

and the other reflected about 1.0 to be <1.0

1.0-fabs(x-1.0)

And we take the average of them. If it is in range, then both values will be the same as x, so their average will again be x. If it is out of range, then one of the values will be x, and the other will be x flipped over the "boundary" point, so their average will be precisely the boundary point.

As pointed out above, fmin/fmax functions work well (in gcc, with -ffast-math). Although gfortran has patterns to use IA instructions corresponding to max/min, g++ does not. In icc one must use instead std::min/max, because icc doesn't allow short-cutting the specification of how fmin/fmax work with non-finite operands.

My 2 cents in C++. Probably not any different than use ternary operators and hopefully no branching code is generated

template <typename T>
inline T clamp(T val, T lo, T hi) {
    return std::max(lo, std::min(hi, val));
}
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!