DoK Talks #141 - Dossier: multi-tenant distributed Jupyter Notebooks

Name: DoK Talks #141 - Dossier: multi-tenant distributed Jupyter Notebooks
Start: 2022-07-14T09:00:00-07:00
End: 2022-07-14T10:00:00-07:00

Data on Kubernetes

Jul 14, 2022, 4:00 – 5:00 PM

Virtual event

Iacoppo Colonnelli - Post-doctoral Reasearcher, University of Torino Dario Tranchitella - Technical Advisor, CLASTIX

About this event

https://go.dok.community/slack
https://dok.community/
ABSTRACT OF THE TALK
When providing data analysis as a service, one must tackle several problems. Data privacy and protection by design are crucial when working on sensitive data. Performance and scalability are fundamental for compute-intensive workloads, e.g. training Deep Neural Networks. User-friendly interfaces and fast prototyping tools are essential to allow domain experts to experiment with new techniques. Portability and reproducibility are necessary to assess the actual value of results.
Kubernetes is the best platform to provide reliable, elastic, and maintainable services. However, Kubernetes alone is not enough to achieve large-scale multi-tenant reproducible data analysis. OOTB support for multi-tenancy is too rough, with only two levels of segregation (i.e. the single namespace or the entire cluster). Offloading computation to off-cluster resources is non-trivial and requires the user's manual configuration. Also, Jupyter Notebooks per se cannot provide much scalability (they execute locally and sequentially) and reproducibility (users can run cells in any order and any number of times).
The Dossier platform allows system administrators to manage multi-tenant distributed Jupyter Notebooks at the cluster level in the Kubernetes way, i.e. through CRDs. Namespaces are aggregated in Tenants, and all security and accountability aspects are managed at that level. Each Notebook spawns into a user-dedicated namespace, subject to all Tenant-level constraints. Users can rely on provisioned resources, either in-cluster worker nodes or external resources like HPC facilities. Plus, they can plug their computing nodes in a BYOD fashion. Notebooks are interpreted as distributed workflows, where each cell is a task that one can offload to a different location in charge of its execution.

KEY TAKE-AWAYS FROM THE TALK
From this talk, people will learn:
 - The different requirements of Data analysis as a service
 - How to configure for multi-tenancy at the cluster level with Capsule
 - How to write distributed workflows as Notebooks with Jupyter Workflows
 - How to combine all these aspects into a single platform: Dossier
All the software presented in the talk is OpenSource, so attendees can directly play with them and include them in their experiments with no additional restrictions.

OTHER LINKS
https://streamflow.di.unito.it/
https://jupyter-workflow.di.unito.it/
https://capsule.clastix.io/
https://github.com/clastix/capsule

Speakers

Dario Tranchitella

CLASTIX

Technical Advisor

Iacopo Colonnelli

University of Torino

Post-doctoral Researcher

Host

Bart Farrell

Data on Kubernetes Community

CNCF Ambassador

Organizers

Paul Au

Constantia.io

Community Manager

See bio

Melissa Logan

Organizer

See bio

Diogenese Topper

Data on Kubernetes Community

Organizer

See bio

CONTACT US