Floating point conversion for 8-bit floating point numbers

问题

Consider the following 8-bit (yes, 8-bit, not 8-byte) floating point representation based on the IEEE floating point format.

Format A:
There is one sign bit.
There are k=3 exponent bits.
There are n=4 fraction bits.

Format B:
There is one sign bit.
There are k=4 exponent bits.
There are n=3 fraction bits.

Below, you are given some bit patterns of pattern A. Your task is to find out the values of numbers given by format A and also convert them to the closest value in format B.
Format A                       Format B
  Bits             Value          Bits 
  1 010 1000 
  1 110 0000 
  0 101 1010 
  0 000 1001

This is homework... I don't want the assignment done for me. I just want to learn on how to convert. Floating point gets me extremely confused.

Can someone just please make up a "Format A" and show me how to get the value/convert step-by-step?

回答1:

The question is missing many details that are important for defining a floating point format. I'm going to try to answer the first part of the question filling in the missing information by assuming that everything unspecified follows the common rules for binary interchange formats in IEEE Std 754-2008 IEEE Standard for Floating-Point Arithmetic.

The given parameters for Format A, in terms of Table 3.3 in the standard, are k=8 and p=5 (italic letters are parameters in the standard, not the question).

From that, and the formula in the standard, bias = emax = 2**(k - p - 1) - 1 = 3.

Taking example bits 0 001 0011

The fraction is, in binary, 0011/10000, decimal 3/16 = 0.1875. The exponent bits are non-zero so it is a normal value, with a non-stored leading one bit, so the significand is 1.1875.

The exponent is, in binary, 001-011, decimal 1-3 = -2.

Multiply the signficand by 2**(-2) = 1/4, giving absolute value 0.296875. Since the sign bit is zero, the absolute value is the final value.

来源：https://stackoverflow.com/questions/18691199/floating-point-conversion-for-8-bit-floating-point-numbers

标签

floating-point

32-bit