Quantization and overflow handling

Quantization modes

The following figure illustrates the effect of the different quantization modes on quantizing a fixed-point number with three fractional bits to none. The dots corresponds to different values, where the red dots correspond to numbers that were initially an integer and yellow dots to numbers that are exactly between two integers (ties). Below each plot is an error distribution histogram, where the red line indicates the bias (average error) of the quantization. Note that the bias will converge towards zero the more bits are quantized away (except for QuantizationMode.TRN, which will converge towards a half).

Illustration of the different quantization modes.
class apytypes.QuantizationMode(value)
RND = 5

Round to nearest, ties toward positive infinity.

Illustration of round to nearest, ties toward positive infinity
RND_CONV = 9

Round to nearest, ties to even.

Illustration of round to nearest, ties to even
RND_CONV_ODD = 10

Round to nearest, ties to odd.

Illustration of round to nearest, ties to odd
RND_INF = 7

Round to nearest, ties away from zero.

Illustration of round to nearest, ties away from zero
RND_MIN_INF = 8

Round to nearest, ties toward negative infinity.

Illustration of round to nearest, ties toward negative infinity
RND_ZERO = 6

Round to nearest, ties towards zero.

Illustration of round to nearest, ties toward zero
TRN = 0

Round towards negative infinity (truncation).

Implementation: remove additional bits.

Illustration of tounding towards negative infinity (truncation)
TRN_INF = 1

Round towards positive infinity.

Illustration of rounding towards positive infinity
TRN_ZERO = 2

Round towards zero (unbiased magnitude truncation).

Illustration of round towards zero (unbiased magnitude truncation)
TRN_AWAY = 3

Round away from zero.

Illustration of round away from zero
TRN_MAG = 4

Fixed-point magnitude truncation (add sign-bit).

Illustration of magnitude truncation
JAM = 11

Jamming/von Neumann rounding.

Illustration of jamming/von Neumann rounding
JAM_UNBIASED = 12

Unbiased jamming/von Neumann rounding.

Illustration of unbiased jamming/von Neumann rounding

Aliases

TO_NEG = 0

Fixed-point truncation. Round towards negative infinity. Alias for TRN.

TO_POS = 1

Round towards positive infinity. Alias for TRN_INF.

TO_ZERO = 2

Unbiased magnitude truncation. Round towards zero. Alias for TRN_ZERO.

TO_AWAY = 3

Round away from zero. Alias for TRN_AWAY.

TIES_POS = 5

Fixed-point rounding. Round to nearest, ties toward positive infinity. Alias for RND.

TIES_EVEN = 9

Unbiased fixed-point rounding. Round to nearest, ties to even. Alias for RND_CONV.

TIES_ODD = 10

Alternate unbiased fixed-point rounding. Round to nearest, ties to odd. Alias for RND_CONV_ODD.

TIES_AWAY = 7

Round to nearest, ties away from zero. Alias for RND_INF.

TIES_NEG = 8

Round to nearest, ties toward negative infinity. Alias for RND_MIN_INF.

TIES_ZERO = 6

Round to nearest, ties toward zero. Alias for RND_ZERO.

Utility functions

apytypes.get_float_quantization_mode() QuantizationMode

Get current quantization context.

Returns:
QuantizationMode
apytypes.set_float_quantization_mode(mode: QuantizationMode) None

Set current quantization context.

Parameters:
modeQuantizationMode

The quantization mode to use.

apytypes.get_float_quantization_seed() int

Set current quantization seed.

The quantization seed is used for stochastic quantization.

Returns:
int
apytypes.set_float_quantization_seed(seed: int) None

Set current quantization seed.

The quantization seed is used for stochastic quantization.

Parameters:
seedint

The quantization seed to use.

Sign of zero for floating-point

For multiplication and division the sign is always the XOR of the operands’ signs, but for addition and subtraction the sign depends on the quantization mode. Below is a table showing what the sign of zero will be in different cases. Using this table one can derive the sign for subtraction as well.

Sign of zero in floating-point addition

\(x + y\)

TO_NEG

Other modes

\((+0) + (+0)\)

\(+0\)

\(+0\)

\((+0) + (-0)\)

\(-0\)

\(+0\)

\((-0) + (+0)\)

\(-0\)

\(+0\)

\((-0) + (-0)\)

\(-0\)

\(-0\)

\(x + y, x = -y\)

\(-0\)

\(+0\)

Overflow modes

class apytypes.OverflowMode(value)
WRAP = 0

Two’s complement wrapping.

Illustration of two's complement wrapping
SAT = 1

Two’s complement saturation.

Illustration of two's complement saturation
NUMERIC_STD = 2

Keep sign bit and remove intermediate bits.

As resize for signed in ieee.numeric_std.

Illustration of numeric_std overflowing