IEEE 754 Standard for Floating Point
Representation
of Real Numbers.
There are four pieces of info to be represented:
Sign of the number
(Always the high order bit; 0=positive, 1=negative.)
Magnitude of the number
(Stored in binary with leading "1" understood. See below.)
Sign of the exponent
(Stored as an offset "bias" on value of exponent. See below.)
Magnitude of the exponent.
(Stored as unsigned binary added to offset bias.)
I. Float. (32 bits) Binary
template: SEEE
EEEE EMMM MMMM ... MMMM
where S = sign bit, E
= exponent bits (8 of these), M = mantissa bits (23
of these).
Process to represent:
A.
Convert decimal number to binary.
B.
Move radix point to 1.xxx... * 2^exp representation.
C.
Now, do the work:
i. S. Assign the sign bit. positive --> 0,
negative --> 1.
ii. Assign the mantissa. This is the fractional
part of the value of the number.
--ignore the leading "1". It is always 1 in this scientific notation, so
there is no need to store it.
The system will re-insert it later when the number is used.
--form the 23 MMM...MMM bits from the first 23 of the remaining "xxx..." bits
above.
If there are not 23 of them, fill out with trailing zeros.
iii. The exponent "exp" may be positive or negative.
Float is stored with excess-127 notation.
That is, you ADD 127 to your exponent and store as a pure (unsigned) binary.
When the
number is re-created for use later, 127 will be subtracted from the exponent.
This method saves having to use 2's complement for negative exponents.
iv: Combine in recipe SEEEEEEEEMMMM...MMM format.
Example:
Find the representation of decimal -10.5 in float IEEE-754
representation.
A. Convert the number: -10.5 decimal -->
-1010.1 binary
B. Move radix to get scientific notation:
-1010.1 binary --> -1.0101 * 2^(+3) (binary)
C. Now, do the work:
i. Sign bit will be 1 since the number is negative.
ii. Mantissa.
(101010000...0)
Ignore leading one: 01010000...0
Use first 23 bits:
01010000000000000000000 -->This is mantissa representation.
iii. Exponent is +3.
ADD excess-127 offset: 127 + (+3) = 130 decimal
= 10000010 binary.
These are the 8 bits for the exponent: 10000010.
D. Combine in SEEEEEEEEMMMMMMMMMMMMMMMMMMMMMMM format:
11000001001010000000000000000000
To make it easier to read, group in 4's and convert each group to hex:
original
11000001001010000000000000000000
binary
grouped:
1100 0001 0010 1000 0000 0000 0000 0000
binary
C 1 2 8
0 0 0 0 hex
or,
C128 0000 hex
II. double.
The process to convert to "double" is similar to
"float":
1. a total of 64 bits are used to represent the number.
2. The sign bit still is one digit, always the high order digit.
3. The mantissa is stored with 52 bits instead of 23. This
gives double the greater precision.
4. The exponent is stored with 11 bits instead of 8. This
gives double the wider range of numbers
it can represent. The exponent is stored excess-1023.
Example to try to work out: Decimal +10.5
--> 0100 0000 0010 0101 0000 ... 0000 binary (64 bits)
4025 0000 0000 0000 hex