# Scalar quantization to a signed integer

Back to Blog

04.03.2009

## Abstract

This paper discusses the scalar quantization of a real number range [-1, 1] to a p-bit signed integer range [-2^(p - 1), 2^(p - 1) - 1]. Our approach is to list a set of requirements for the quantization, and then find conversion functions that fulfill these requirements. Our result will be that the conversion functions to both directions are given by:

```typedef double real;
typedef int integer;

template <int P>
real dequantizeSigned(integer i)
{
enum
{
N = (1 << (P - 1)) - 1
};

return clamp(i / ((real)N + 0.5), -1, 1);
}

template <int P>
integer quantizeSigned(real x)
{
enum
{
N = (1 << (P - 1)) - 1
};

const real sign = (x >= 0) ? 0.5 : -0.5;

return (integer)clamp(x * ((real)N + 0.5) + sign, -N, N);
}
```

Let f = dequantizeSigned, g = quantizeSigned, I = [-2^(p - 1) - 1, 2^(p - 1) - 1] subset Z, X = [-1, 1] subset R. The conversion functions f and g enjoy the following properties:

• for all i in I: g(f(i)) = i
• The supremum error between f(g(x)) and x is minimal.
• g maps a uniform random distribution on X to a uniform random distribution on I.
• for all i in I: f(-i) = -f(i) (in particular, f(0) = 0).
• for all x in X: g(-x) = -g(x) (in particular, g(0) = 0).
• g(-1) = -(2^(p - 1) - 1)
• g(1) = 2^(p - 1) - 1

To be able to achieve these properties, g maps -2^(p - 1) (the minimum p-bit signed integer value) to g(-(2^(p - 1) - 1)).