CNCF On-Demand Webinar: Running a multi-tenant platform on a managed Kubernetes cluster

Name: CNCF On-Demand Webinar: Running a multi-tenant platform on a managed Kubernetes cluster
Start: 2022-10-20T00:00:00-07:00
End: 2022-10-21T00:00:00-07:00

CNCF Online Programs

Oct 20, 2022, 7:00 AM – Oct 21, 2022, 7:00 AM (UTC)

Virtual event

About this event

Capital One created an internal multi-tenant platform running machine learning pipelines at scale on Kubernetes. The platform runs thousands of pods daily, each performing unique, complex tasks for tenants in isolated namespaces. These tenants and namespaces are administered centrally by our platform, while the platform itself is a tenant on a cluster administered by another team. This architecture has led to unique design and operational decisions. 
 
In creating the platform, the team learned a great deal about effectively operating a complex, multi-tenant platform running on a centrally managed Kubernetes cluster. This talk will cover incidents encountered which exposed pain points of such an operating model, troubleshooting methods, and operational enhancements implemented in response to the incidents. 
 
We'll also discuss best practices the team learned including establishing clear communication and responsibilities with our cluster admins, staying up to date on and testing infrastructure change to detect regression, and continually reviewing logging and monitoring as the application gains complexity.

Speakers

David Harrington

Capital One

Manager

Patrick Hennis

Capital One

Software Engineer

Kristian Langholm

Capital One

Senior Software Engineer

Ankur Mohan

Capital One

Director of Engineering

Trever Hallock

Capital One

Senior Machine Learning Engineer

Cruise Hall

Capital One

Senior Software Engineer

CONTACT US