Converting SIGNED fractions to UNSIGNED fixed point for addition and multiplication

放肆的年华 提交于 2019-12-13 09:07:47

问题


How can we convert floating point numbers to their "fixed-point representations", and use their "fixed-point representations" in fixed-point operations such as addition and multiplication? The result in the fixed-point operation must yield to the correct answer when converted back to floating point.

Say:

(double)(xb_double) + (double)(xb_double) = ?

Then we convert both addends to a fixed point representation (integer),

(int)(xa_fixed) + (int)(xb_fixed) = (int) (xsum_fixed)

To get (double)(xsum_double), we convert (int)(sum_fixed) back to floating point and yield same answer,

FixedToDouble(xsum_fixed) => xsum_double

Specifically, if the range of the values of xa_double and xb_double is between -1.65 and 1.65, I want to convert xa_double and xb_double in their respective 10-bit fixed point representations (0x0000 to 0x03FF)

WHAT I HAVE TRIED

int fixed_MAX = 1023;
int fixed_MIN = 0;
double Value_MAX = 1.65;
double Value_MIN = -1.65;

double slope = ((fixed_MAX) - (fixed_MIN))/((Value_MAX) - (Value_MIN));

int DoubleToFixed(double x)
{
return round(((x) - Value_MIN)*slope + fixed_MIN); //via interpolation method
}

double FixedToDouble(int x)
{
return (double)((((x) + fixed_MIN)/slope) + Value_MIN);
}

int sum_fixed(int x, int y)
{
    return (x + y - (1.65*slope)); //analysis, just basic math
}

int subtract_fixed(int x, int y)
{
    return (x - y + (1.65*slope));
}

int product_fixed(int x, int y)
{
    return (((x * y) - (slope*slope*((1.65*FixedToDouble(x)) + (1.65*FixedToDouble(y)) + (1.65*1.65))) + (slope*slope*1.65)) / slope);
}

And if I want to add (double)(1.00) + (double)(2.00) = which should yield to (double)(3.00),

With my code,

xsum_fixed = DoubleToFixed(1.00) + DoubleToFixed(2.00);
xsum_double = FixedToDouble(xsum_fixed);

I get the answer:

xsum_double = 3.001613

Which is very close to the correct answer (double)(3.00)

Also, if I perform multiplication and subtraction I get 2.004839 and -1.001613, respectively.

HERE'S THE CATCH:

So I know my code is working, but how can I perform addition, multiplication and subtraction on these fixed-point representations without having INTERNAL FLOATING POINT OPERATIONS AND NUMBERS.

So in the code above, the functions sum_fixed, product_fixed, and subtract_fixed have internal floating point numbers (slope and 1.65, 1.65 being the MAX float input). I derived my code by basic math, really.

So I want to implement add, subtract, and product functions without any internal floating point operations or numbers.

UPDATE:

I also found a simpler code in converting fractional numbers to fixed-point:

//const int scale = 16; //1/2^16 in 32 bits

#define DoubleToFixed(x) (int)((x) * (double)(1<<scale))
#define FixedToDouble(x) ((double)(x) / (double)(1<<scale))
#define FractionPart(x) ((x) & FractionMask)

#define MUL(x,y) (((long long)(x)*(long long)(y)) >> scale)
#define DIV(x, y) (((long long)(x)<<16)/(y)) 

However, this converts only UNSIGNED fractions to UNSIGNED fixed-point. And I want to convert SIGNED fractions (-1.65 to 1.65) to UNSIGNED fixed-point (0x0000 to 0x03FF). How can I do this with the use of this code above? Is the range or number of bits have something to do with the conversion process? Is this code only for positive fractions?

credits to @chux


回答1:


You can have the mantissa of the floating point representation of your number be equal to its fixed point representation. Since FP addition shifts the smaller operand's mantissa until both operands have the same exponent, you can add a certain 'magic number' to force it. For double, it's 1<<(52-precision) (52 is double's mantissa size, 'precision' is the required number of binary precision digits). So the conversion would look like this:

union { double f; long long i; } u = { xfloat+(1ll<<52-precision) }; // shift x's mantissa
long long xfixed = u.i & (1ll<<52)-1; // extract the mantissa

After that you can use xfixed in integer math (for multiplication, you'd have to shift the result right by 'precision'). To convert it back to double, simply multiply it by 1.0/(1 << precision);

Note that it doesn't handle negatives. If you need them, you'd have to convert them to the complementary representation manually (first fabs the double, then negate the int result if the input was negative).



来源:https://stackoverflow.com/questions/34125700/converting-signed-fractions-to-unsigned-fixed-point-for-addition-and-multiplicat

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!