Parse double precision IEEE floating-point on a C compiler with no double precision type

问题

I am working with an 8-bit AVR chip. There is no data type for a 64-bit double (double just maps to the 32-bit float). However, I will be receiving 64-bit doubles over Serial and need to output 64-bit doubles over Serial.

How can I convert the 64-bit double to a 32-bit float and back again without casting? The format for both the 32-bit and 64-bit will follow IEEE 754. Of course, I assume a loss of precision when converting to the 32-bit float.

For converting from 64-bit to 32-bit float, I am trying this out:

// Script originally from http://www.arduino.cc/cgi-bin/yabb2/YaBB.pl?num=1281990303
float convert(uint8_t *in) {
  union {
    float real;
    uint8_t base[4];
  } u;
  uint16_t expd = ((in[7] & 127) << 4) + ((in[6] & 240) >> 4);
  uint16_t expf = expd ? (expd - 1024) + 128 : 0;
  u.base[3] = (in[7] & 128) + (expf >> 1);
  u.base[2] = ((expf & 1) << 7) + ((in[6] & 15) << 3) + ((in[5] & 0xe0) >> 5);
  u.base[1] = ((in[5] & 0x1f) << 3) + ((in[4] & 0xe0) >> 5);
  u.base[0] = ((in[4] & 0x1f) << 3) + ((in[3] & 0xe0) >> 5);
  return u.real;
}

For numbers like 1.0 and 2.0, the above works, but when I tested with passing in a 1.1 as a 64-bit double, the output was off by a bit (literally, not a pun!), though this could be an issue with my testing. See:

// Comparison of bits for a float in Java and the bits for a float in C after
// converted from a 64-bit double. Last bit is different.
// Java code can be found at https://gist.github.com/912636
JAVA FLOAT:        00111111 10001100 11001100 11001101
C CONVERTED FLOAT: 00111111 10001100 11001100 11001100

回答1:

IEEE specifies five different rounding modes, but the one to use by default is Round half to even. So you have a mantissa of the form 10001100 11001100 11001100 11001100... and you have to round it to 24 bits. Numbering the bits from 0 (most significant), bit 24 is 1; but that is not enough to tell you whether to round bit 23 up or not. If all the remaining bits were 0, you would not round up, because bit 23 is 0 (even). But the remaining bits are not zero, so you round up in all cases.

Some examples:

10001100 11001100 11001100 10000000...(all zero) doesn't round up, because bit 23 is already even.

10001100 11001100 11001101 10000000...(all zero) does round up, because bit 23 is odd.

10001100 11001100 1100110x 10000000...0001 always rounds up, because the remaining bits are not all zero.

10001100 11001100 1100110x 0xxxxxxx... never rounds up, because bit 24 is zero.

回答2:

The following code seems to convert from single precision to double precision. I'll leave it as an exercise to the reader to implement the narrowing version. This should get you started though. The hardest part is getting the bit positions correct in the significand. I includes some comments that include what is going on.

double
extend_float(float f)
{
    unsigned char flt_bits[sizeof(float)];
    unsigned char dbl_bits[sizeof(double)] = {0};
    unsigned char sign_bit;
    unsigned char exponent;
    unsigned int  significand;
    double out;

    memcpy(&flt_bits[0], &f, sizeof(flt_bits));
    /// printf("---------------------------------------\n");
    /// printf("float = %f\n", f);
#if LITTLE_ENDIAN
    reverse_bytes(flt_bits, sizeof(flt_bits));
#endif
    /// dump_bits(&flt_bits[0], sizeof(flt_bits));

    /* IEEE 754 single precision
     *    1 sign bit              flt_bits[0] & 0x80
     *    8 exponent bits         flt_bits[0] & 0x7F | flt_bits[1] & 0x80
     *   23 fractional bits       flt_bits[1] & 0x7F | flt_bits[2] & 0xFF |
     *                            flt_bits[3] & 0xFF
     *
     * E = 0   & F  = 0 -> +/- zero
     * E = 0   & F != 0 -> sub-normal
     * E = 127 & F  = 0 -> +/- INF
     * E = 127 & F != 0 -> NaN
     */
    sign_bit = (flt_bits[0] & 0x80) >> 7;
    exponent = ((flt_bits[0] & 0x7F) << 1) | ((flt_bits[1] & 0x80) >> 7);
    significand = (((flt_bits[1] & 0x7F) << 16) |
                   (flt_bits[2] << 8) |
                   (flt_bits[3]));

    /* IEEE 754 double precision
     *    1 sign bit              dbl_bits[0] & 0x80
     *   11 exponent bits         dbl_bits[0] & 0x7F | dbl_bits[1] & 0xF0
     *   52 fractional bits       dbl_bits[1] & 0x0F | dbl_bits[2] & 0xFF
     *                            dbl_bits[3] & 0xFF | dbl_bits[4] & 0xFF
     *                            dbl_bits[5] & 0xFF | dbl_bits[6] & 0xFF
     *                            dbl_bits[7] & 0xFF
     *
     * E = 0    & F  = 0 -> +/- zero
     * E = 0    & F != 0 -> sub-normal
     * E = x7FF & F  = 0 -> +/- INF
     * E = x7FF & F != 0 -> NaN
     */
    dbl_bits[0] = flt_bits[0] & 0x80; /* pass the sign bit along */

    if (exponent == 0) {
        if (significand  == 0) { /* +/- zero */
            /* nothing left to do for the outgoing double */
        } else { /* sub-normal number */
            /* not sure ... pass on the significand?? */
        }
    } else if (exponent == 0xFF) { /* +/-INF and NaN */
        dbl_bits[0] |= 0x7F;
        dbl_bits[1]  = 0xF0;
        /* pass on the significand */
    } else { /* normal number */
        signed int int_exp = exponent;
        int_exp -= 127;  /* IEEE754 single precision exponent bias */
        int_exp += 1023; /* IEEE754 double precision exponent bias */
        dbl_bits[0] |= (int_exp & 0x7F0) >> 4;  /* 7 bits */
        dbl_bits[1]  = (int_exp & 0x00F) << 4;  /* 4 bits */
    }

    if (significand != 0) {
        /* pass on the significand most-significant-bit first */
        dbl_bits[1] |=  (flt_bits[1] & 0x78) >> 3;    /* 4 bits */
        dbl_bits[2] = (((flt_bits[1] & 0x07) << 5) |  /* 3 bits */
                       ((flt_bits[2] & 0xF8) >> 3));  /* 5 bits */
        dbl_bits[3] = (((flt_bits[2] & 0x07) << 5) |  /* 3 bits */
                       ((flt_bits[3] & 0xF8) >> 3));  /* 5 bits */
        dbl_bits[4] =  ((flt_bits[3] & 0x07) << 5);   /* 3 bits */
    }

    ///dump_bits(&dbl_bits[0], sizeof(dbl_bits));
#if LITTLE_ENDIAN
    reverse_bytes(&dbl_bits[0], sizeof(dbl_bits));
#endif
    memcpy(&out, &dbl_bits[0], sizeof(out));

    return out;
}

I left some printf lines in but commented out in C++ style comments. You will have to provide the appropriate definitions for reverse_bytes, LITTLE_ENDIAN, and dump_bits. I didn't want to spoil all of the fun for you afterall. The Wikipedia entries on single precision and double precision numbers are very good.

If you are going to be tinkering with floating point numbers a lot, you should read "What Every Computer Scientist Should Know About Floating Point Arithmetic" by David Goldberg and "How to Print Floating Point Numbers Accurately" by Steele and White. They are the two most informative articles out there when it comes to understanding how floating point numbers work.

回答3:

http://www.google.com/search?q=c+convert+ieee+754+double+single

one of the first results is this:

http://www.mathworks.com/matlabcentral/fileexchange/23173

The code shows how to convert IEEE-754 double into IEEE-754-like (1,5,10) floating format. This code contains lots of comments, and mentions the typical traps that you might fall into.

It's not exactly what you want, but it's a good starting point.

回答4:

There is only one full implementation of IEEE754 double in GCC for AVR that I am aware of, and you can find it here.

You will need this archive and then replace avr_f64.c from the archive with this one.

Library takes about 21K Flash and 310 bytes of ram.

Original post can be found here. I have extracted all important information from the original post and presented here since I think that you need to have an account to log in the forum.

来源：https://stackoverflow.com/questions/5614222/parse-double-precision-ieee-floating-point-on-a-c-compiler-with-no-double-precis

标签

c++

casting

avr