A floating-point representation encodes rational number of the form
This notation represents a number b defined as
Here are some practices:
The IEEE floating-point standard represents a number in a form $V = (-1)^s M 2^E$
- The sign s determines whether the number is negative(s = 1) or positive(s = 0).
- The exponent E weights the value by a power of 2.
- The significand M is a fractional binary number.
If exp is neither all zeros nor all ones, in this case, this exp field is interpreted as representing a baised form. That is, the exp value is
And
When the exp field is all zeros, the represented number is in denormalized form. In this case, the exp value is
Denormalized numbers server two purposes. First, they provide a way to represent numeric value 0, since with a normalized number wwe must always have
A final category of values occurs when the exp field is all ones.
-
When the fraction filed is all zeros, the resulting values represent infinity, either positive inf when s = 0, and negative inf when s = 1. It can represent results that overflow, as when we multiply two very large numbers, or when we divide by zero.
-
When the fraction field is nonzero, the resulting value is called Nan, short for "not a number". As when computing
$\sqrt{-1}$ .
Floating-point arithmetic can only approximate real arithmetic, since the representation has limited range and precision. Thus, for a value
Here are four rounding modes:
Mode | 1.4 | 1.6 | 1.5 | 2.5 | -1.5 |
---|---|---|---|---|---|
Round-to-even | 1 | 2 | 2 | 3 | -2 |
Round-toward-zero | 1 | 1 | 1 | 2 | -1 |
Round-down | 1 | 1 | 1 | 2 | -2 |
Round-up | 2 | 2 | 2 | 3 | -1 |
Some practices:
Because of rounding, floating-point operations:
- lack of associativity: a + b + c may not equal a + c + b.
- if a >= b, x + a >= x + b.
- lack of distributivity.
And casting values between int
, float
, double
:
- From
int
tofloat
, the number can't overflow, but it may be rounded. - From
int
orfloat
todouble
,the exact numeric value can be preserved. - From
double
tofloat
, the value can overflow or rounded. - From
float
ordouble
toint
, the value will be rounded toward zero. And the value may overflow.
Some practices: