ML & AI System Design for Staff Engineers

14 episodes covering ML infrastructure, AI-era systems, and Staff-level architectural thinking — feature stores, model serving, training platforms, vector search, LLM serving, RAG, and AI gateway design.

14 modules14 available~3.5 hours total

About This Course

This is Part 2 of our System Design series, focused on ML and AI infrastructure.

Staff-level interviews increasingly test ML infrastructure knowledge — not ML algorithms, but the systems that train, serve, and experiment with models at scale. This series teaches you how to design those systems.

We start with how ML system design interviews differ from traditional system design. Then we build up the core ML infrastructure stack: feature stores for training-serving consistency, stream processing for feature freshness, recommendation systems (two-tower retrieval, ANN search), vector search, and model serving platforms.

The AI/LLM section covers three cutting-edge topics: LLM serving (KV cache, batching, GPU scheduling), RAG at scale, and AI gateway / multi-model routing. We close with organizational scalability and a capstone end-to-end ML platform design exercise.

Original curriculum inspired by publicly available engineering blog posts, industry papers on ML infrastructure, and Staff+ engineering experience.

Prerequisites

Completion of Part 1 (System Design Interview) or equivalent knowledge
Working knowledge of distributed systems (caching, sharding, message queues)
Basic understanding of ML concepts (training, inference, features, models)
Familiarity with at least one backend language (Python, Java, Go, C++)
No ML research experience required — this is about infrastructure, not algorithms

What You Will Learn

Understand how ML system design interviews differ from traditional system design
Design feature stores with training-serving consistency guarantees
Design recommendation infrastructure: candidate generation, ranking, real-time personalization
Design model serving platforms with canary rollout, A/B traffic splitting, and latency SLAs
Design distributed training infrastructure: parameter servers, AllReduce, GPU scheduling
Design vector search systems using ANN algorithms (FAISS, HNSW) at scale
Design LLM serving platforms: KV cache management, request batching, speculative decoding
Design RAG systems: document ingestion, chunking, embedding, retrieval optimization
Design AI gateways: multi-provider routing, cost-aware scheduling, semantic caching
Execute an end-to-end ML platform design from ingestion to serving

Terminology Mapping

How classic concepts map to the terminology used in this course.

Classic	This Course (Meta)
Feature Store	Feature Platform / Featurizer
Model Registry	Model Store
Model Serving	Predictor / Navi
Vector Search	FAISS
Stream Processing	Flink-based pipelines
Experiment Platform	A/B Testing System
AI Gateway	Multi-model router

Your Learning Path

Each module builds on the last. Take your time—the AI tutor is with you at every step.

ML System Design Interviews — How ML SD interviews differ from traditional SD — signals, structure, and evaluation

15 minVideo lecture

Design ML Feature Store — Offline/online feature consistency, training-serving skew, and real-time feature serving

15 minVideo lecture

Design Real-Time Stream Processing System — Stateful stream processing, windowed aggregation, and fault tolerance at scale

15 minVideo lecture

Design Recommendation System Infrastructure — Two-tower retrieval, candidate generation, ranking, and real-time personalization

15 minVideo lecture

Design Vector Search System (ANN at Scale) — FAISS, HNSW, sharded vector indices, and hybrid search at scale

15 minVideo lecture

Design Model Serving Platform — Multi-model hosting, canary rollout, GPU scheduling, and latency SLAs

15 minVideo lecture

Design ML Training Platform at Scale — Distributed training, GPU scheduling, checkpointing, and experiment tracking

15 minVideo lecture

Design A/B Testing System (Deep Dive) — Randomization, exposure logging, interference effects, and metric attribution at scale

15 minVideo lecture

Design Multi-Region Active-Active System — Conflict resolution, CRDTs, and global consistency at scale

15 minVideo lecture

Design LLM Serving Platform — KV cache, request batching, speculative decoding, and GPU memory scheduling

15 minVideo lecture

Design RAG System at Scale — Document ingestion, chunking, embedding, and retrieval-augmented generation

15 minVideo lecture

Design AI Gateway / Multi-Model Routing — Multi-provider routing, cost-aware scheduling, token billing, and semantic caching

15 minVideo lecture

Designing for Organizational Scalability — API boundaries, ownership models, and platform vs product infrastructure

15 minVideo lecture

Capstone: End-to-End ML Platform Design — Design a complete ML platform from data ingestion to model serving — a 45-minute Staff-level exercise

15 minVideo lecture