Let
e be the exponent with E bits,
m be the mantissa with M bits,
N = floor(log_2(m))
r = m / 2^N
R = M - N.
For now, suppose that R < B - 1
.
A floating-point number is normal, if e > -(E / 2) + 1
. The value of a normal number is given by
x = (-1)^s 2^e (1 + m / 2^M)
= (-1)^s 2^(e - M) (2^M + m)
= (-1)^s 2^(e - R) (2^R + r)
A floating-point number overflows a rational number, if
|x| >= 2^(B - 1)
<=> 2^(e - R) (2^R + r) >= 2^(B - 1)
<== 2^(e - R) (2^R + 0) >= 2^(B - 1)
<=> 2^e >= 2^(B - 1)
<=> e >= B - 1.
We will now show that the implication <== can be replaced with equivalence. Suppose e = B - 2
. Then
|x| >= 2^(B - 1)
<=> 2^(B - 2 - R) (2^R + r) >= 2^(B - 1)
<=> 2^R + r >= 2^(R + 1)
<=> r >= 2^R
<=> false.
So that exponents less than B - 1
never overflow. We have that
A normal number overflows a rational number if and only if `e >= B - 1`.
In this case the rational number is assigned [(-1)^s, 0]
, which is infinity or minus-infinity.
A rational number with a B-bit denominator has at most 2^(-(B - 1)) bit precision. Therefore, a floating-point number underflows a rational number, if
|x| < 2^(-(B - 1))
<=> 2^(e - R) (2^R + r) < 2^(-(B - 1))
<== 2^(e - R) (2^R + 2^R) < 2^(-(B - 1))
<=> 2^(e + 1) < 2^(-B + 1)
<=> e + 1 < -B + 1
<=> e < -B.
We will now show that the implication <== can be replaced with equivalence. Suppose e = -(B - 1)
. Then
|x| < 2^(-(B - 1))
<=> 2^(-(B - 1) - R) (2^R + r) < 2^(-(B - 1))
<=> 2^R + r < 2^R
<=> r < 0
<=> false.
So that exponents greater than -B
never underflow. We have that
A normal number underflows a rational number if and only if `e < -B`.
In this case the rational number is assigned [0, 1]
, which is zero.
A floating-point number is an integer, if e >= R
. This fits exactly into a rational number, if it does not overflow the numerator. Therefore, we have
A normal number is a rational number `[(-1)^s 2^(e - R) (2^R + r), 1]` if and only if `R <= e < B - 1`.
A floating-point number with -B <= e < R
overflows the denumerator of the rational number [(-1)^s (2^R + r), 2^(R - e)]
, if
2^(R - e) >= 2^(B - 1)
<=> R - e >= B - 1
<=> R - (B - 1) >= e.
The numerator overflows, if
2^R + r >= 2^(B - 1)
<== 2^R >= 2^(B - 1)
<=> R >= B - 1.
We have that
A normal number is a rational number `[(-1)^s (2^R + r), 2^(R - e)]` if and only if `-B <= e <= R - (B - 1)`.
Suppose R - (B - 1) < e < R
. Then [(-1)^s (2^R + r), 2^(R - e)]
is not a valid rational number, since the denumerator overflows. There may exist a rational number with an exact solution. However, sometimes there does not. In general, we would like to find the best rational approximation assuming a denominator at most 2^(B - 1)
.
The best rational approximation can be found by searching through the semi-convergents of (the continued fraction of) x
. However, I am worried about the numerical stability of computing the semi-convergents, which must be done using floating-point arithmetic. Note that the convergents of x
do not produce all of the best rational approximations.
An alternative is to shift the numerator and the denumerator by equal amounts until the numerator becomes at most 2^(B - 1)
. This is easy, but is not a best rational approximation.