Quantization: Precision vs. Speed

How shrinking numbers from 32 bits to 4 bits keeps AI fast and cheap

As models grow to hundreds of billions of parameters, storing every weight in 32-bit float becomes impractical. Quantization maps continuous values to a smaller discrete set using affine transformations: Q = round(W/S + Z). This module covers the linear algebra of quantization: why it's an affine transformation, how it distorts the vector space, why 'outlier features' break naive approaches, and how techniques like GPTQ, AWQ, and SmoothQuant handle this. You'll implement basic quantization and measure the accuracy/speed tradeoff. Mini-lab: Quantize a weight matrix to INT8 and INT4. Measure the reconstruction error (Frobenius norm of W - W_quantized) and see where outlier features cause problems.

Estimated time: 60 minutes

Stuck on something? The AI tutor sees this lecture—just ask.

Loading learning experience...