Scalar quantization to a signed integer

Back to Blog

04.03.2009

Abstract

This paper discusses the scalar quantization of a real number range [-1, 1] to a p-bit signed integer range [-2^(p - 1), 2^(p - 1) - 1]. Our approach is to list a set of requirements for the quantization, and then find conversion functions that fulfill these requirements. Our result will be that the conversion functions to both directions are given by:

typedef double real;
typedef int integer;

template <int P>
real dequantizeSigned(integer i)
{
    enum
    {
        N = (1 << (P - 1)) - 1
    };
    
    return clamp(i / ((real)N + 0.5), -1, 1);
}

template <int P>
integer quantizeSigned(real x)
{
    enum
    {
        N = (1 << (P - 1)) - 1
    };

    const real sign = (x >= 0) ? 0.5 : -0.5;

    return (integer)clamp(x * ((real)N + 0.5) + sign, -N, N);
}

Let f = dequantizeSigned, g = quantizeSigned, I = [-2^(p - 1) - 1, 2^(p - 1) - 1] subset Z, X = [-1, 1] subset R. The conversion functions f and g enjoy the following properties:

To be able to achieve these properties, g maps -2^(p - 1) (the minimum p-bit signed integer value) to g(-(2^(p - 1) - 1)).

Paper

Download the paper (PDF)

Brute force test

Below is a program that verifies the formulas given in the paper. The test is to convert an integer to a real and back to an integer. Then we test for equality. This is tested on every p-bit integer.With real arithmetic, there should be no errors (except the intentional -2^(p - 1) to -2^(p - 1) - 1 conversion, but we don’t count that). However, conversion errors could result from round-off errors when using floating point arithmetic. Using double precision, there are no errors when p is in the range [2, 32]. Using single precision, there are no errors when p is in the range [2, 25], but after that the single precision results in lots of errors.

Files

Brute-force test for range conversion