TL;DR This architecture uses Karpenter + KEDA + Dragonfly on EKS to scale GPU inference pods from zero, pull model images quicker, and cut GPU spend with spot-first provisioning. Cold starts are 84s; warm starts are 7s (with small image). Everything is GitOps-driven via ArgoCD and fully reproducible with Terraform.
If you’re tired of paying for GPU nodes that sit idle half the day, or waiting minutes for cold starts when traffic suddenly spikes, this guide is for you.
Most teams running GPU inference on Kubernetes eventually hit the same wall:
GPUs are expensive
Traffic is spiky
Cold starts are painful
Large model images make everything worse
Scaling is either too slow or too costly
GitOps workflows often break when autoscaling enters the picture
This architecture solves all of that.
It gives you:
Scale‑to‑zero when idle
Fast burst capacity when demand arrives
Predictable cost with spot‑first provisioning
Minimal cold‑start pain, even with 8–40 GB model images
Discussion
Say something first
It all starts with you—share your thoughts now.