Kueue for Batch and AI Jobs by Trex Team

Kueue for Batch and AI Jobs

By

Description

"Kueue for Batch and AI Jobs: Queueing, Fairness, and Resource Sharing in Kubernetes"
Batch pipelines and AI training don’t fail in Kubernetes because “the scheduler is broken”—they fail because multi-tenant clusters need explicit queueing, clear admission policy, and predictable sharing rules. This book is written for experienced Kubernetes platform engineers, ML infrastructure builders, and SREs who must turn scarce CPU/GPU fleets into reliable, fair, high-throughput systems without replacing the Kubernetes scheduler or turning operations into constant firefighting.
You’ll build a precise mental model of Kueue’s control loop and object graph (Workload, LocalQueue, ClusterQueue, and ResourceFlavor), then learn to design queue hierarchies and capacity policy that match real organizations. The book goes deep on admission mechanics—quota reservation, scheduling gates, and extensible AdmissionChecks—so you can reason about “admitted but not running” states, integrate provisioning/autoscaling, and even dispatch across clusters with MultiKueue. It culminates in fairness engineering: cohorts, borrowing, and preemption, with practical criteria for balancing utilization, latency, and blast radius—especially in heterogeneous GPU pools.
Expect an operationally grounded approach: troubleshooting runbooks, observability and SLO design for queueing systems, and upgrade-safe guidance on CRD versioning and migrations (including v1beta1 → v1beta2). Familiarity with Kubernetes controllers, RBAC, and resource requests/limits is assumed.

More Trex Team Books