A floating-point number is a finite or infinite number that is representable in a floating-point format, i.e., a floating-point representation that is not a NaN.
In the IEEE 754-2008 standard, all floating-point numbers - including zeros and infinities - are signed.
IEEE 754-2008 allows for five "basic formats" for floating-point numbers including three binary formats (32-, 64-, and 128-bit) and two decimal formats (64- and 128-bit); it also specifies several "recommended formats" for extending these basic formats to allow for even higher precision. All basic numerical formats are characterized by specifying a radix , a precision (i.e., the number of digits in the significand), and an exponent range determined by the precision of the given format. In general, the nonzero floating-point numbers have the form
where indicates the sign of the number, is its exponent, and is its significand. Note that the description in (1) is framed so that the significand is viewed in scientific form (with the period or radix point immediately following the first digit), though (1) may be re-expressed to view as an integer instead (whereby both and the exponent in (1) will change format accordingly).
32-bit binary | 64-bit binary | 128-bit binary | 64-bit Decimal | 128-bit Decimal | |
digits of | 24 | 53 | 113 | 16 | 34 |
emax | +127 | +1023 | +16383 | +384 | +6144 |
The above table summarizes the characteristics of the five basic number formats. Note that by definition.
32-bit binary | 64-bit binary | 128-bit binary | 64-bit decimal | 128-bit decimal | |
digits of | |||||
emax |
As mentioned previously, IEEE 754 also provides a framework of recommended formats by which the five basic formats may be extended. The table above summarizes the characteristics for the parameters of these extended-format floating-point numbers. Note that all such formats-both basic and recommended-allow for and , , and two NaNs.
In the literature, a distinction is made between normal and subnormal floating-point numbers. In particular, the smallest positive normal floating-point number is and the largest is ; on the other hand, non-zero floating-point numbers having magnitude less than may exist and are called subnormal. Subnormal numbers are characterized by the fact that they always have fewer than significant digits; moreover, every finite floating-point number is an integral multiple of the smallest subnormal magnitude
(IEEE Computer Society 2008).