Dec 4, 8:00 AM – Dec 5, 7:59 AM (UTC)
The demand for scalable and cost-efficient inference of large language models (LLMs) is outpacing the capabilities of traditional serving stacks. Unlike conventional ML workloads, LLM inference brings unique challenges: long prompts, token-by-token generation, bursty traffic patterns, and the need for consistently high GPU utilization. These factors make request routing, deterministic scheduling, and autoscaling significantly more complex than typical model serving scenarios. In this presentation, we will discuss Kserve, an open source project for LLM Inference at scale.
Nutanix AI
Member of Technical Staff
Nutanix AI
Member of Technical Staff
CONTACT US