Quantization: Precision vs. Speed
How shrinking numbers from 32 bits to 4 bits keeps AI fast and cheap
As models grow to hundreds of billions of parameters, storing every weight in 32-bit float becomes impractical. Quantization maps continuous values to a smaller discrete set using affine transformations: Q = round(W/S + Z). This module covers the linear algebra of quantization: why it's an affine transformation, how it distorts the vector space, why 'outlier features' break naive approaches, and how techniques like GPTQ, AWQ, and SmoothQuant handle this. You'll implement basic quantization and measure the accuracy/speed tradeoff. Mini-lab: Quantize a weight matrix to INT8 and INT4. Measure the reconstruction error (Frobenius norm of W - W_quantized) and see where outlier features cause problems.
Estimated time: 60 minutes
Stuck on something? The AI tutor sees this lecture—just ask.
Loading learning experience...