Converting SIGNED fractions to UNSIGNED fixed point for addition and multiplication

问题

How can we convert floating point numbers to their "fixed-point representations", and use their "fixed-point representations" in fixed-point operations such as addition and multiplication? The result in the fixed-point operation must yield to the correct answer when converted back to floating point.

Say:

(double)(xb_double) + (double)(xb_double) = ?

Then we convert both addends to a fixed point representation (integer),

(int)(xa_fixed) + (int)(xb_fixed) = (int) (xsum_fixed)

To get (double)(xsum_double), we convert (int)(sum_fixed) back to floating point and yield same answer,

FixedToDouble(xsum_fixed) => xsum_double

Specifically, if the range of the values of xa_double and xb_double is between -1.65 and 1.65, I want to convert xa_double and xb_double in their respective 10-bit fixed point representations (0x0000 to 0x03FF)

WHAT I HAVE TRIED

int fixed_MAX = 1023;
int fixed_MIN = 0;
double Value_MAX = 1.65;
double Value_MIN = -1.65;

double slope = ((fixed_MAX) - (fixed_MIN))/((Value_MAX) - (Value_MIN));

int DoubleToFixed(double x)
{
return round(((x) - Value_MIN)*slope + fixed_MIN); //via interpolation method
}

double FixedToDouble(int x)
{
return (double)((((x) + fixed_MIN)/slope) + Value_MIN);
}

int sum_fixed(int x, int y)
{
    return (x + y - (1.65*slope)); //analysis, just basic math
}

int subtract_fixed(int x, int y)
{
    return (x - y + (1.65*slope));
}

int product_fixed(int x, int y)
{
    return (((x * y) - (slope*slope*((1.65*FixedToDouble(x)) + (1.65*FixedToDouble(y)) + (1.65*1.65))) + (slope*slope*1.65)) / slope);
}

And if I want to add (double)(1.00) + (double)(2.00) = which should yield to (double)(3.00),

With my code,

xsum_fixed = DoubleToFixed(1.00) + DoubleToFixed(2.00);
xsum_double = FixedToDouble(xsum_fixed);

I get the answer:

xsum_double = 3.001613

Which is very close to the correct answer (double)(3.00)

Also, if I perform multiplication and subtraction I get 2.004839 and -1.001613, respectively.

HERE'S THE CATCH:

So I know my code is working, but how can I perform addition, multiplication and subtraction on these fixed-point representations without having INTERNAL FLOATING POINT OPERATIONS AND NUMBERS.

So in the code above, the functions sum_fixed, product_fixed, and subtract_fixed have internal floating point numbers (slope and 1.65, 1.65 being the MAX float input). I derived my code by basic math, really.

So I want to implement add, subtract, and product functions without any internal floating point operations or numbers.

UPDATE:

I also found a simpler code in converting fractional numbers to fixed-point:

//const int scale = 16; //1/2^16 in 32 bits

#define DoubleToFixed(x) (int)((x) * (double)(1<<scale))
#define FixedToDouble(x) ((double)(x) / (double)(1<<scale))
#define FractionPart(x) ((x) & FractionMask)

#define MUL(x,y) (((long long)(x)*(long long)(y)) >> scale)
#define DIV(x, y) (((long long)(x)<<16)/(y))

However, this converts only UNSIGNED fractions to UNSIGNED fixed-point. And I want to convert SIGNED fractions (-1.65 to 1.65) to UNSIGNED fixed-point (0x0000 to 0x03FF). How can I do this with the use of this code above? Is the range or number of bits have something to do with the conversion process? Is this code only for positive fractions?

credits to @chux

回答1:

You can have the mantissa of the floating point representation of your number be equal to its fixed point representation. Since FP addition shifts the smaller operand's mantissa until both operands have the same exponent, you can add a certain 'magic number' to force it. For double, it's 1<<(52-precision) (52 is double's mantissa size, 'precision' is the required number of binary precision digits). So the conversion would look like this:

union { double f; long long i; } u = { xfloat+(1ll<<52-precision) }; // shift x's mantissa
long long xfixed = u.i & (1ll<<52)-1; // extract the mantissa

After that you can use xfixed in integer math (for multiplication, you'd have to shift the result right by 'precision'). To convert it back to double, simply multiply it by 1.0/(1 << precision);

Note that it doesn't handle negatives. If you need them, you'd have to convert them to the complementary representation manually (first fabs the double, then negate the int result if the input was negative).

来源：https://stackoverflow.com/questions/34125700/converting-signed-fractions-to-unsigned-fixed-point-for-addition-and-multiplicat

标签

floating-point

signal-processing

fixed-point