Rational number from an IEEE floating point number

Let

e be the exponent with E bits,
m be the mantissa with M bits,
N = floor(log_2(m))
r = m / 2^N
R = M - N.

For now, suppose that R < B - 1.

Normal numbers

A floating-point number is normal, if e > -(E / 2) + 1. The value of a normal number is given by

x = (-1)^s 2^e (1 + m / 2^M)
  = (-1)^s 2^(e - M) (2^M + m)
  = (-1)^s 2^(e - R) (2^R + r)

Normal number overflow

A floating-point number overflows a rational number, if

    |x| >= 2^(B - 1)
<=> 2^(e - R) (2^R + r) >= 2^(B - 1)
<== 2^(e - R) (2^R + 0) >= 2^(B - 1)
<=> 2^e >= 2^(B - 1)
<=> e >= B - 1.

We will now show that the implication <== can be replaced with equivalence. Suppose e = B - 2. Then

    |x| >= 2^(B - 1)
<=> 2^(B - 2 - R) (2^R + r) >= 2^(B - 1)
<=> 2^R + r >= 2^(R + 1)
<=> r >= 2^R
<=> false.

So that exponents less than B - 1 never overflow. We have that

A normal number overflows a rational number if and only if `e >= B - 1`.

In this case the rational number is assigned [(-1)^s, 0], which is infinity or minus-infinity.

Normal number underflow

A rational number with a B-bit denominator has at most 2^(-(B - 1)) bit precision. Therefore, a floating-point number underflows a rational number, if

    |x| < 2^(-(B - 1))
<=> 2^(e - R) (2^R + r) < 2^(-(B - 1))
<== 2^(e - R) (2^R + 2^R) < 2^(-(B - 1))
<=> 2^(e + 1) < 2^(-B + 1)
<=> e + 1 < -B + 1
<=> e < -B.

We will now show that the implication <== can be replaced with equivalence. Suppose e = -(B - 1). Then

    |x| < 2^(-(B - 1))
<=> 2^(-(B - 1) - R) (2^R + r) < 2^(-(B - 1))
<=> 2^R + r < 2^R
<=> r < 0
<=> false.

So that exponents greater than -B never underflow. We have that

A normal number underflows a rational number if and only if `e < -B`.

In this case the rational number is assigned [0, 1], which is zero.

Normal number integers

A floating-point number is an integer, if e >= R. This fits exactly into a rational number, if it does not overflow the numerator. Therefore, we have

A normal number is a rational number `[(-1)^s 2^(e - R) (2^R + r), 1]` if and only if `R <= e < B - 1`.

Normal number rationals

A floating-point number with -B <= e < R overflows the denumerator of the rational number [(-1)^s (2^R + r), 2^(R - e)], if

    2^(R - e) >= 2^(B - 1)
<=> R - e >= B - 1
<=> R - (B - 1) >= e.

The numerator overflows, if

    2^R + r >= 2^(B - 1)
<== 2^R >= 2^(B - 1)
<=> R >= B - 1.

We have that

A normal number is a rational number `[(-1)^s (2^R + r), 2^(R - e)]` if and only if `-B <= e <= R - (B - 1)`.

Normal number inexact rationals

Suppose R - (B - 1) < e < R. Then [(-1)^s (2^R + r), 2^(R - e)] is not a valid rational number, since the denumerator overflows. There may exist a rational number with an exact solution. However, sometimes there does not. In general, we would like to find the best rational approximation assuming a denominator at most 2^(B - 1).

The best rational approximation can be found by searching through the semi-convergents of (the continued fraction of) x. However, I am worried about the numerical stability of computing the semi-convergents, which must be done using floating-point arithmetic. Note that the convergents of x do not produce all of the best rational approximations.

An alternative is to shift the numerator and the denumerator by equal amounts until the numerator becomes at most 2^(B - 1). This is easy, but is not a best rational approximation.

Files

?

rational_as_real.hpp