Small Models, Big Agents: Fine-Tuning and Serving SLMs on Kubernetes

by Christian Melendez, AWS

📍 Atlas 2 AI / ML Intermediate

12:30 – 13:00

NVIDIA Research recently argued that small language models (yup, not LLMs) are the future of agentic AI. Models under 10 billion parameters can handle the majority of agentic tasks while being cheaper, faster, and easier to deploy. But the real story isn't about picking one size. It's about knowing when to use which.

In this session, we'll go end-to-end: from fine-tuning both a small and a 70B language model with QLoRA on Kubernetes, to serving inference with llama.cpp, to autoscaling the whole thing with KEDA and Karptenter. You'll see how to build a domain-specific agentic tool, expose it via MCP (Model Context Protocol), and right-size your models. Using SLMs for the repetitive, scoped tasks that make up most agentic workflows, and larger models when deeper reasoning is needed.

We'll cover how Karpenter handles node provisioning across different instance types and architectures, how KEDA drives workload-based autoscaling for inference, and why the combination lets you run AI workloads without over-provisioning. Whether you're a platform engineer curious about AI on Kubernetes or an ML practitioner tired of fighting for GPU quota, this talk offers a practical, production-tested path from fine-tuning to serving at scale.