Kubernetes Guide
Contents
Overview
Kubernetes is being adopted into HPC clusters to orchestrate deployments (e.g. software, infrastructure) and run certain workloads (e.g. AI/ML inference). There is ongoing interest in integrating Kubernetes and Slurm to achieve a unified cluster, optimized resource utilization, and workflows that leverage each system.
The ways in which Slurm and Kubernetes are designed to handle certain types of workloads may change over time. Additionally, how they interact with each other may change, allowing for new possibilities. This is still an evolving area.
SUNK — Slurm on Kubernetes
SUNK ("Slurm on Kubernetes") is an effort under development with CoreWeave that provides a converged Slurm and Kubernetes environment.
Their recent blog post, Introducing SUNK: A Slurm on Kubernetes Implementation for HPC and Large Scale AI, provides an overview of the architecture and use cases.
SUNK is expected to be open-sourced in early 2024.
Presentations
Note that older presentations may contain outdated information.
Presentations from 2023
- Slurm and/or/vs Kubernetes, Tim Wickberg, SchedMD (SC23, November 2023)
- The Best of Both Worlds: Slurm on Kubernetes, Navarre Pratt and Jacob Feldman, CoreWeave (SLUG23, November 2023)
- Never use Slurm HA again: Solve all your problems with Kubernetes, Chris Samuel and Doug Jacobsen, NERSC (SLUG23, November 2023)
Last modified 11 December 2023