We solve problems with real numbers in assembly language using FPU. Usually we write input and output code using C language or ready functions.For example:
;
This is really the same question as how to implement those functions / how they work under the hood. I'm just going to talk about input in this answer; I'm not sure what algorithms are good for float->string.
OS-provided functions let you read / write (print) characters, one at a time or in blocks. The interesting / FP-specific part of the problem is only the float->string and string->float part. Everything else is the same as for reading/printing integers (modulo calling-convention differences: floats are usually returned in FP registers).
Correctly implementing strtod (string to double) and the single-precision equivalent is highly non-trivial if you want the result to always be correctly rounded to the nearest representable FP value, especially if you want it to also be efficient, and work for inputs right up to the limits of the biggest finite values that double
can hold.
Once you know the details of the algorithm (in terms of looking at single digits and doing FP multiplies / divides / additions, or integer operations on the FP bit-pattern), you can of implement it in asm for any platform you like. You used an x87 finit
instruction in your example for some reason.
See http://www.exploringbinary.com/how-glibc-strtod-works/ for a detailed look at glibc's implementation, and http://www.exploringbinary.com/how-strtod-works-and-sometimes-doesnt/ for another widely-used implmentation.
Outlining the first article, glibc's strtod
uses extended-precision integer arithmetic. It parses the input decimal string to determine the integer part and the fractional part. e.g. 456.833e2
(scientific notation) has an integer part of 45683
and a fractional part 0.3
.
It converts both parts to floating point separately. The integer part is easy, because there's already hardware support for converting integers to floating point. e.g. x87 fild
or SSE2 cvtsi2sd
, or whatever else on other architectures. But if the integer part is larger than the maximum 64-bit integer, it's not that simple, and you need to convert BigInteger to float/double, which hardware doesn't support.
Note that even FLT_MAX
(single precision) for IEEE binary32 float
is (2 − 2^−23) × 2^127, which is just slightly below 2^128, so you could use a 128-bit integer for string->float
, and if that wraps then the correct float
result is +Infinity
. The FLT_MAX
bit pattern is 0x7f7fffff: mantissa all-ones = 1.999... with max exponent. In decimal, it's ~3.4 × 10^38
.
But if you didn't care about efficiency, I think you could convert each digit to a float
(or index an array of already-converted float
values), and do the usual total = total*10 + digit
, or in this case total = total*10.0 + digit_values[digit]
. FP mul / add is exact for integers up to the point where two adjacent representable values are farther apart than 1.0 (i.e. when nextafter(total, +Infinity)
is total+2.0
), i.e. when 1 ulp is greater than 1.0
.
Actually, to get correct rounding you need to add the small values first, otherwise they each separately round down when all together they could have bumped a large value up to the next representable value.
So you can probably use the FPU for this if you do it carefully, like working in chunks of 8 digits and scaling by 10^8 or something, and add starting with the smallest. You could convert each string of 8 digits to integer and use hardware int
->float
.
The fractional part is even trickier, especially if you want to avoid repeated division by 10 to get the place values, which you should avoid because it's slow and because 1/10
is not exactly representable in binary floating point so all your place values will have rounding error if you do it the "obvious" way.
But if the integer part is very large, all 53 mantissa bits of the double
might already be determined by the integer part. So glibc checks, and only does big-integer division to get the number of bits it needs (if any) from the fractional part.
Anyway, I highly recommend reading both articles.
BTW, see https://en.wikipedia.org/wiki/Double-precision_floating-point_format if you're not familiar the bit patterns that IEEE754 binary64, aka double
, uses to represent numbers. You don't need to be to write a simplistic implementation, but it does help to understand float. And with x86 SSE, you need to know where the sign bit is to implement absolute value (ANDPS) or negation (XORPS). Fastest way to compute absolute value using SSE. There aren't special instructions for abs
or neg
, you just use boolean ops to manipulate the sign bit. (Much more efficient than subtracting from zero.)
If you don't care about being accurate to the last ULP (unit in the last place = lowest bit of the mantissa), then you can do a simpler algorithm of multiplying by 10 and adding like for string -> integer, and then scale by a power of 10 at the end.
But a robust library function can't do that, because creating a temporary value many times larger than the final result means it will overflow (to +/- Infinity
) for some inputs that are within the range that double
can represent. Or possibly underflow to +/- 0.0
if you create smaller temporary values.
Handling the integer and fractional part separately avoids the overflow problem.
See this C implementation on codereview.SE for an example of a very simple multiply/add approach that will probably overflow. I only skimmed it quickly, but I don't see it splitting integer/fractional part. It only handles scientific notation E99
or whatever at the end, with repeated multiply or divide by 10.