Capital One created an internal multi-tenant platform running machine learning pipelines at scale on Kubernetes. The platform runs thousands of pods daily, each performing unique, complex tasks for tenants in isolated namespaces. These tenants and namespaces are administered centrally by our platform, while the platform itself is a tenant on a cluster administered by another team. This architecture has led to unique design and operational decisions.
In creating the platform, the team learned a great deal about effectively operating a complex, multi-tenant platform running on a centrally managed Kubernetes cluster. This talk will cover incidents encountered which exposed pain points of such an operating model, troubleshooting methods, and operational enhancements implemented in response to the incidents.
We'll also discuss best practices the team learned including establishing clear communication and responsibilities with our cluster admins, staying up to date on and testing infrastructure change to detect regression, and continually reviewing logging and monitoring as the application gains complexity.