04.03.2009

This paper discusses the scalar quantization of a real number range [-1, 1] to a p-bit signed integer range [-2^(p - 1), 2^(p - 1) - 1]. Our approach is to list a set of requirements for the quantization, and then find conversion functions that fulfill these requirements. Our result will be that the conversion functions to both directions are given by:

```
typedef double real;
typedef int integer;
template <int P>
real dequantizeSigned(integer i)
{
enum
{
N = (1 << (P - 1)) - 1
};
return clamp(i / ((real)N + 0.5), -1, 1);
}
template <int P>
integer quantizeSigned(real x)
{
enum
{
N = (1 << (P - 1)) - 1
};
const real sign = (x >= 0) ? 0.5 : -0.5;
return (integer)clamp(x * ((real)N + 0.5) + sign, -N, N);
}
```

Let f = dequantizeSigned, g = quantizeSigned, I = [-2^(p - 1) - 1, 2^(p - 1) - 1] subset Z, X = [-1, 1] subset R. The conversion functions f and g enjoy the following properties:

- for all i in I: g(f(i)) = i
- The supremum error between f(g(x)) and x is minimal.
- g maps a uniform random distribution on X to a uniform random distribution on I.
- for all i in I: f(-i) = -f(i) (in particular, f(0) = 0).
- for all x in X: g(-x) = -g(x) (in particular, g(0) = 0).
- g(-1) = -(2^(p - 1) - 1)
- g(1) = 2^(p - 1) - 1

To be able to achieve these properties, g maps -2^(p - 1) (the minimum p-bit signed integer value) to g(-(2^(p - 1) - 1)).

Below is a program that verifies the formulas given in the paper. The test is to convert an integer to a real and back to an integer. Then we test for equality. This is tested on every p-bit integer.With real arithmetic, there should be no errors (except the intentional -2^(p - 1) to -2^(p - 1) - 1 conversion, but we don’t count that). However, conversion errors could result from round-off errors when using floating point arithmetic. Using double precision, there are no errors when p is in the range [2, 32]. Using single precision, there are no errors when p is in the range [2, 25], but after that the single precision results in lots of errors.