ML & AI System Design for Staff Engineers

14 episodes covering ML infrastructure, AI-era systems, and Staff-level architectural thinking — feature stores, model serving, training platforms, vector search, LLM serving, RAG, and AI gateway design.

14 modules0 available~3.5 hours total

About This Course

This is Part 2 of our System Design series, focused on ML and AI infrastructure.

Staff-level interviews increasingly test ML infrastructure knowledge — not ML algorithms, but the systems that train, serve, and experiment with models at scale. This series teaches you how to design those systems.

We start with how ML system design interviews differ from traditional system design. Then we build up the core ML infrastructure stack: feature stores for training-serving consistency, stream processing for feature freshness, recommendation systems (two-tower retrieval, ANN search), vector search, and model serving platforms.

The AI/LLM section covers three cutting-edge topics: LLM serving (KV cache, batching, GPU scheduling), RAG at scale, and AI gateway / multi-model routing. We close with organizational scalability and a capstone end-to-end ML platform design exercise.

Original curriculum inspired by publicly available engineering blog posts, industry papers on ML infrastructure, and Staff+ engineering experience.

Prerequisites

  • Completion of Part 1 (System Design Interview) or equivalent knowledge
  • Working knowledge of distributed systems (caching, sharding, message queues)
  • Basic understanding of ML concepts (training, inference, features, models)
  • Familiarity with at least one backend language (Python, Java, Go, C++)
  • No ML research experience required — this is about infrastructure, not algorithms

What You Will Learn

  • Understand how ML system design interviews differ from traditional system design
  • Design feature stores with training-serving consistency guarantees
  • Design recommendation infrastructure: candidate generation, ranking, real-time personalization
  • Design model serving platforms with canary rollout, A/B traffic splitting, and latency SLAs
  • Design distributed training infrastructure: parameter servers, AllReduce, GPU scheduling
  • Design vector search systems using ANN algorithms (FAISS, HNSW) at scale
  • Design LLM serving platforms: KV cache management, request batching, speculative decoding
  • Design RAG systems: document ingestion, chunking, embedding, retrieval optimization
  • Design AI gateways: multi-provider routing, cost-aware scheduling, semantic caching
  • Execute an end-to-end ML platform design from ingestion to serving

Terminology Mapping

How classic concepts map to the terminology used in this course.

ClassicThis Course (Meta)
Feature StoreFeature Platform / Featurizer
Model RegistryModel Store
Model ServingPredictor / Navi
Vector SearchFAISS
Stream ProcessingFlink-based pipelines
Experiment PlatformA/B Testing System
AI GatewayMulti-model router

Your Learning Path

Each module builds on the last. Take your time—the AI tutor is with you at every step.

1

ML System Design InterviewsHow ML SD interviews differ from traditional SD — signals, structure, and evaluation

15 minComing soon
2

Design ML Feature StoreOffline/online feature consistency, training-serving skew, and real-time feature serving

15 minComing soon
3

Design Real-Time Stream Processing SystemStateful stream processing, windowed aggregation, and fault tolerance at scale

15 minComing soon
4

Design Recommendation System InfrastructureTwo-tower retrieval, candidate generation, ranking, and real-time personalization

15 minComing soon
5

Design Vector Search System (ANN at Scale)FAISS, HNSW, sharded vector indices, and hybrid search at scale

15 minComing soon
6

Design Model Serving PlatformMulti-model hosting, canary rollout, GPU scheduling, and latency SLAs

15 minComing soon
7

Design ML Training Platform at ScaleDistributed training, GPU scheduling, checkpointing, and experiment tracking

15 minComing soon
8

Design A/B Testing System (Deep Dive)Randomization, exposure logging, interference effects, and metric attribution at scale

15 minComing soon
9

Design Multi-Region Active-Active SystemConflict resolution, CRDTs, and global consistency at scale

15 minComing soon
10

Design LLM Serving PlatformKV cache, request batching, speculative decoding, and GPU memory scheduling

15 minComing soon
11

Design RAG System at ScaleDocument ingestion, chunking, embedding, and retrieval-augmented generation

15 minComing soon
12

Design AI Gateway / Multi-Model RoutingMulti-provider routing, cost-aware scheduling, token billing, and semantic caching

15 minComing soon
13

Designing for Organizational ScalabilityAPI boundaries, ownership models, and platform vs product infrastructure

15 minComing soon
14

Capstone: End-to-End ML Platform DesignDesign a complete ML platform from data ingestion to model serving — a 45-minute Staff-level exercise

15 minComing soon