CNCF On-Demand Webinar: Running a multi-tenant platform on a managed Kubernetes cluster

CNCF Online Programs

Oct 20, 2022, 7:00 AM – Oct 21, 2022, 7:00 AM

Virtual event

About this event

Capital One created an internal multi-tenant platform running machine learning pipelines at scale on Kubernetes. The platform runs thousands of pods daily, each performing unique, complex tasks for tenants in isolated namespaces. These tenants and namespaces are administered centrally by our platform, while the platform itself is a tenant on a cluster administered by another team. This architecture has led to unique design and operational decisions.

In creating the platform, the team learned a great deal about effectively operating a complex, multi-tenant platform running on a centrally managed Kubernetes cluster. This talk will cover incidents encountered which exposed pain points of such an operating model, troubleshooting methods, and operational enhancements implemented in response to the incidents.

We'll also discuss best practices the team learned including establishing clear communication and responsibilities with our cluster admins, staying up to date on and testing infrastructure change to detect regression, and continually reviewing logging and monitoring as the application gains complexity.

Speakers

  • David Harrington

    Capital One

    Manager

  • Patrick Hennis

    Capital One

    Software Engineer

  • Kristian Langholm

    Capital One

    Senior Software Engineer

  • Ankur Mohan

    Capital One

    Director of Engineering

  • Trever Hallock

    Capital One

    Senior Machine Learning Engineer

  • Cruise Hall

    Capital One

    Senior Software Engineer

CONTACT US