fixed-point

Notation for fixed point representation

一曲冷凌霜 提交于 2019-12-07 09:04:50
问题 I'm looking for a commonly understandable notation to define a fixed point number representation. The notation should be able to define both a power-of-two factor (using fractional bits) and a generic factor (sometimes I'm forced to use this, though less efficient). And also an optional offset should be defined. I already know some possible notations, but all of them seem to be constrained to specific applications. For example the Simulink notation would perfectly fit my needs, but it's known

C++: Emulated Fixed Point Division/Multiplication

白昼怎懂夜的黑 提交于 2019-12-07 02:46:24
I'm writing a Fixedpoint class, but have ran into bit of a snag... The multiplication, division portions, I am not sure how to emulate. I took a very rough stab at the division operator but I am sure it's wrong. Here's what it looks like so far: class Fixed { Fixed(short int _value, short int _part) : value(long(_value + (_part >> 8))), part(long(_part & 0x0000FFFF)) {}; ... inline Fixed operator -() const // example of some of the bitwise it's doing { return Fixed(-value - 1, (~part)&0x0000FFFF); }; ... inline Fixed operator / (const Fixed & arg) const // example of how I'm probably doing it

Fastest way to multiply two 64-bit ints to 128-bit then >> to 64-bit? [duplicate]

江枫思渺然 提交于 2019-12-06 07:34:47
问题 This question already has answers here : Computing high 64 bits of a 64x64 int product in C (5 answers) Closed 2 years ago . I need to multiply two signed 64-bit integers a and b together, then shift the (128-bit) result to a signed 64-bit integer. What's the fastest way to do that? My 64-bit integers actually represent fixed-point numbers with fmt fractional bits. fmt is chosen so that a * b >> fmt should not overflow, for instance abs(a) < 64<<fmt and abs(b) < 2<<fmt with fmt==56 would

64-bit fixed-point multiplication error

老子叫甜甜 提交于 2019-12-06 03:57:44
问题 I'm implementing a 64-bit fixed-point signed 31.32 numeric type in C#, based on long . So far so good for addition and substraction. Multiplication however has an annoying case I'm trying to solve. My current algorithm consist of splitting each operand into its most and least significant 32 bits, performing 4 multiplications into 4 longs and adding the relevant bits of these longs. Here it is in code: public static Fix64 operator *(Fix64 x, Fix64 y) { var xl = x.m_rawValue; // underlying long

Add saturate 32-bit signed ints intrinsics?

隐身守侯 提交于 2019-12-06 03:05:45
Can someone recommend a fast way to add saturate 32-bit signed integers using Intel intrinsics (AVX, SSE4 ...) ? I looked at the intrinsics guide and found _mm256_adds_epi16 but this seems to only add 16-bit ints. I don't see anything similar for 32 bits. The other calls seem to wrap around. A signed overflow will happen if (and only if): the signs of both inputs are the same, and the sign of the sum (when added with wrap-around) is different from the input Using C-Operators: overflow = ~(a^b) & (a^(a+b)) . Also, if an overflow happens, the saturated result will have the same sign as either

Adding Library to VHDL Project

∥☆過路亽.° 提交于 2019-12-05 07:42:45
问题 I am trying to use fixed point numbers in my VHDL project, but I keep having trouble implementing the library (found here http://www.eda-stds.org/fphdl/fixed_pkg_c.vhdl). The error I receive when trying to simulate is this <ufixed> is not declared My question is how exactly should a library be implemented so it can be used? As of now I have added it to the project in the IEEE_PROPOSED library, but it is not working. All source code can be found here https://github.com/srohrer32/beamformer

Avoid Overflow when Calculating π by Evaluating a Series Using 16-bit Arithmetic?

£可爱£侵袭症+ 提交于 2019-12-04 23:08:19
I'm trying to write a program that calculates decimal digits of π to 1000 digits or more. To practice low-level programming for fun, the final program will be written in assembly, on a 8-bit CPU that has no multiplication or division, and only performs 16-bit additions. To ease the implementation, it's desirable to be able to use only 16-bit unsigned integer operations, and use an iterative algorithm. Speed is not a major concern. And fast multiplication and division is beyond the scope of this question, so don't consider those issues as well. Before implementing it in assembly, I'm still

Fastest way to multiply two 64-bit ints to 128-bit then >> to 64-bit? [duplicate]

六眼飞鱼酱① 提交于 2019-12-04 11:55:01
This question already has an answer here: Computing high 64 bits of a 64x64 int product in C 5 answers I need to multiply two signed 64-bit integers a and b together, then shift the (128-bit) result to a signed 64-bit integer. What's the fastest way to do that? My 64-bit integers actually represent fixed-point numbers with fmt fractional bits. fmt is chosen so that a * b >> fmt should not overflow, for instance abs(a) < 64<<fmt and abs(b) < 2<<fmt with fmt==56 would never overflow in 64-bits as the final result would be < 128<<fmt and therefore fit in an int64. The reason I want to do that is

How to use expr on float?

♀尐吖头ヾ 提交于 2019-12-04 08:16:42
问题 I know it's really stupid question, but I don't know how to do this in bash: 20 / 30 * 100 It should be 66.67 but expr is saying 0 , because it doesn't support float. What command in Linux can replace expr and do this equalation? 回答1: As reported in the bash man page: The shell allows arithmetic expressions to be evaluated, under certain circumstances...Evaluation is done in fixed-width integers with no check for overflow, though division by 0 is trapped and flagged as an error. You can

64-bit fixed-point multiplication error

北慕城南 提交于 2019-12-04 07:29:40
I'm implementing a 64-bit fixed-point signed 31.32 numeric type in C#, based on long . So far so good for addition and substraction. Multiplication however has an annoying case I'm trying to solve. My current algorithm consist of splitting each operand into its most and least significant 32 bits, performing 4 multiplications into 4 longs and adding the relevant bits of these longs. Here it is in code: public static Fix64 operator *(Fix64 x, Fix64 y) { var xl = x.m_rawValue; // underlying long of x var yl = y.m_rawValue; // underlying long of y var xlow = xl & 0x00000000FFFFFFFF; // take the 32