A floating-point number is a finite or infinite number that is representable in a floating-point format, i.e., a floating-point representation that is not a NaN.
In the IEEE 754-2008 standard, all floating-point numbers - including zeros and infinities - are signed.
IEEE 754-2008 allows for five "basic formats" for floating-point numbers including three binary formats (32-, 64-, and 128-bit) and two decimal formats (64-
and 128-bit); it also specifies several "recommended formats" for extending
these basic formats to allow for even higher precision. All basic numerical formats
are characterized by specifying a radix , a precision
(i.e., the number of digits in the significand), and an exponent
range
determined by the precision of the given format. In general, the nonzero floating-point
numbers have the form
where
indicates the sign of the number,
is its exponent, and
is its significand. Note that the description
in (1) is framed so that the significand
is viewed in scientific form (with the period or radix
point immediately following the first digit), though (1) may be re-expressed
to view
as an integer instead (whereby both
and the exponent
in (1) will change format accordingly).
32-bit binary | 64-bit binary | 128-bit binary | 64-bit Decimal | 128-bit Decimal | |
digits of
| 24 | 53 | 113 | 16 | 34 |
emax | +127 | +1023 | +16383 | +384 | +6144 |
The above table summarizes the characteristics of the five basic number formats. Note that
by definition.
32-bit binary | 64-bit binary | 128-bit binary | 64-bit decimal | 128-bit decimal | |
digits of
| |||||
emax |
As mentioned previously, IEEE 754 also provides a framework of recommended formats by which the five basic formats may be extended. The table above summarizes the characteristics
for the parameters of these extended-format floating-point numbers. Note that all
such formats-both basic and recommended-allow for and
,
, and two NaNs.
In the literature, a distinction is made between normal and subnormal floating-point numbers. In particular,
the smallest positive normal floating-point number is and the largest is
; on the other hand, non-zero floating-point
numbers having magnitude less than
may exist and are called subnormal. Subnormal numbers
are characterized by the fact that they always have fewer than
significant digits; moreover, every finite floating-point
number is an integral multiple of the smallest subnormal magnitude
(IEEE Computer Society 2008).