The Attention Mechanism

How transformers use matrix operations to focus on what matters

Attention is the innovation behind transformers, and it's pure linear algebra. This module derives scaled dot-product attention step by step: project inputs into Query, Key, and Value matrices, compute attention scores via QK^T/sqrt(d), apply softmax, then multiply by V. You'll see that multi-head attention is just running several smaller attention operations in parallel -- a block-diagonal matrix structure. No black boxes. Mini-lab: Implement single-head and multi-head attention from scratch in NumPy. Feed in a sentence and visualize the attention weight matrix as a heatmap.

Estimated time: 60 minutes

Stuck on something? The AI tutor sees this lecture—just ask.

Loading learning experience...