Mechanistic Interpretability

Using linear algebra to reverse-engineer what neural networks learn

Mechanistic interpretability is the 'forensics' of AI: researchers use linear algebra to decode what individual neurons and layers actually represent. The key insight is the Linear Representation Hypothesis -- high-level concepts like 'truthfulness,' 'sentiment,' or 'programming language' are encoded as linear directions (vectors) in activation space. You can find these directions, measure them, and even add or subtract them to steer model behavior. This module covers probing classifiers, activation steering, and concept vectors. Mini-lab: Extract activations from a small language model, find the 'sentiment direction' using PCA on positive vs. negative examples, and show that adding this direction flips the model's output sentiment.

Estimated time: 60 minutes

Stuck on something? The AI tutor sees this lecture—just ask.

Loading learning experience...