Notation for fixed point representation

一曲冷凌霜 提交于 2019-12-07 09:04:50

问题


I'm looking for a commonly understandable notation to define a fixed point number representation. The notation should be able to define both a power-of-two factor (using fractional bits) and a generic factor (sometimes I'm forced to use this, though less efficient). And also an optional offset should be defined.
I already know some possible notations, but all of them seem to be constrained to specific applications.

  • For example the Simulink notation would perfectly fit my needs, but it's known only in the Simulink world. Furthermore the overloaded usage of the fixdt() function is not so readable.

  • TI defines a really compact Q Formats, but the sign is implicit, and it doesn't manage a generic factor (i.e. not a power-of-two).

  • ASAM uses a generic 6-coefficient rational function with 2nd-degree numerator and denominator polynomials (COMPU_METHOD). Very generic, but not so friendly.

See also the Wikipedia discussion.

The question is only about the notation (not efficiency of the representation nor fixed-point manipulation). So it's a matter of code readability, maintenability and testability.


回答1:


Ah, yes. Having good naming annotations is absolutely critical to not introducing bugs with fixed point arithmetic. I use an explicit version of the Q notation which handles any division between M and N by appending _Q<M>_<N> to the name of the variable. This also makes it possible to include the signedness as well. There are no run-time performance penalties for this. Example:

uint8_t length_Q2_6;                // unsigned, 2 bit integer, 6 bit fraction
int32_t sensor_calibration_Q10_21;  // signed (1 bit), 10 bit integer, 21 bit fraction.

/*
 * Calculations with the bc program (with '-l' argument):
 *
 * sqrt(3)
 * 1.73205080756887729352
 *
 * obase=16
 * sqrt(3)
 * 1.BB67AE8584CAA73B0
 */
const uint32_t SQRT_3_Q7_25 = 1 << 25 | 0xBB67AE85U >> 7; /* Unsigned shift super important here! */

In case someone have not fully understood why such annotation is extremely important, Can you spot the if there is an bug in the following two examples?

Example 1:

speed_fraction = fix32_udiv(25, speed_percent << 25, 100 << 25);
squared_speed  = fix32_umul(25, speed_fraction, speed_fraction);
tmp1 = fix32_umul(25, squared_speed, SQRT_3);
tmp2 = fix32_umul(12, tmp1 >> (25-12), motor_volt << 12);

Example 2:

speed_fraction_Q7_25 = fix32_udiv(25, speed_percent << 25, 100 << 25);
squared_speed_Q7_25  = fix32_umul(25, speed_fraction_Q7_25, speed_fraction_Q7_25);
tmp1_Q7_25  = fix32_umul(25, squared_speed_Q7_25, SQRT_3_Q1_31);
tmp2_Q20_12 = fix32_umul(12, tmp1_Q7_25 >> (25-12), motor_volt << 12);

Imagine if one file contained #define SQRT_3 (1 << 25 | 0xBB67AE85U >> 7) and another file contained #define SQRT_3 (1 << 31 | 0xBB67AE85U >> 1) and code was moved between those files. For example 1 this has a high chance of going unnoticed and introduce the bug that is present in example 2 which here is done deliberately and has a zero chance of being done accidentally.




回答2:


Actually Q format is the most used representation in commercial applications: you use is when you need to deal with fractional numbers FAST and your processor does not have a FPU (floating point unit) is it cannot use float and double data types natively - it has to emulate instructions for them which are very expensive.

usually you use Q format to represent only the fractional part, though this not a must, you get more precision for your representation. Here's what you need to consider:

  • number of bits you use (Q15 uses 16 bitdata types, usually short int)
  • the first bit is the sign bit (out of 16 bits you are left with 15 for data value)
  • the rest of the bits are used to store the fractional part of your number.
  • since you are representing fractional numbers your value is somewhere in [0,1)
  • you can choose to use some bits for the integer part as well, but you would loose precision - e.g if you wanted to represent 3.3 in Q format, you would need 1 bit for sign, 2 bits for the integer part, and are left with 13 bits for the fractional part (assuming you are using 16 bits representation)-> this format is called 2Q13

Example: Say you want to represent 0.3 in Q15 format; you apply the Rule of Three:

      1 = 2^15 = 32768 = 0x8000
    0.3 = X
    -------------
    X = 0.3*32768 = 9830 = 0x666

You lost precision by doing this but at least the computation is fast now.




回答3:


In C, you can't use a user defined type like a builtin one. If you want to do that, you need to use C++. In that language you can define a class for your fixed point type, overload all the arithmetic operators (+, -, *, /, %, +=, -=, *=, /=, %=, --, ++, cast to other types), so that usage of the instances of this class really behave like the builtin types.

In C, you need to do what you want explicitly. There are two basic approaches.


Approach 1: Do the fixed point adjustments in the user code.
This is overhead-free, but you need to remember to do the correct adjustments. I believe, it is easiest to just add the number of past point bits to the end of the variable name, because the type system won't do you much good, even if you typedef'd all the point positions you use. Here is an example:

int64_t a_7 = (int64_t)(7.3*(1<<7));    //a variable with 7 past point bits
int64_t b_5 = (int64_t)(3.78*(1<<5));   //a variable with 5 past point bits
int64_t sum_7 = a_7 + (b_5 << 2);    //to add those two variables, we need to adjust the point position in b
int64_t product_12 = a_7 * b_5;    //the product produces a number with 12 past point bits

Of course, this is a lot of hassle, but at least you can easily check at every point whether the point adjustment is correct.


Approach 2: Define a struct for your fixed point numbers and encapsulate the arithmetic on it in a bunch of functions. Like this:

typedef struct FixedPoint {
    int64_t data;
    uint8_t pointPosition;
} FixedPoint;

FixedPoint fixed_add(FixedPoint a, FixedPoint b) {
    if(a.pointPosition >= b.PointPosition) {
        return (FixedPoint){
            .data = a.data + (b.data << a.pointPosition - b.pointPosition),
            .pointPosition = a.pointPosition
        };
    } else {
        return (FixedPoint){
            .data = (a.data << b.pointPosition - a.pointPosition) + b.data,
            .pointPosition = b.pointPosition
        };
    }
}

This approach is a bit cleaner in the usage, however, it introduces significant overhead. That overhead consists of:

  1. The function calls.

  2. The copying of the structs for parameter and result passing, or the pointer dereferences if you use pointers.

  3. The need to calculate the point adjustments at runtime.

This is pretty much similar to the overhead of a C++ class without templates. Using templates would move some decisions back to compile time, at the cost of loosing flexibility.

This object based approach is probably the most flexible one, and it allows you to add support for non-binary point positions in a transparent way.



来源:https://stackoverflow.com/questions/22422557/notation-for-fixed-point-representation

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!