DoK Talks #141 - Dossier: multi-tenant distributed Jupyter Notebooks

Data on Kubernetes
Thu, Jul 14, 9:00 AM (PDT)

Iacoppo Colonnelli - Post-doctoral Reasearcher, University of Torino Dario Tranchitella - Technical Advisor, CLASTIX

About this event

https://go.dok.community/slack

https://dok.community/

ABSTRACT OF THE TALK

When providing data analysis as a service, one must tackle several problems. Data privacy and protection by design are crucial when working on sensitive data. Performance and scalability are fundamental for compute-intensive workloads, e.g. training Deep Neural Networks. User-friendly interfaces and fast prototyping tools are essential to allow domain experts to experiment with new techniques. Portability and reproducibility are necessary to assess the actual value of results.

Kubernetes is the best platform to provide reliable, elastic, and maintainable services. However, Kubernetes alone is not enough to achieve large-scale multi-tenant reproducible data analysis. OOTB support for multi-tenancy is too rough, with only two levels of segregation (i.e. the single namespace or the entire cluster). Offloading computation to off-cluster resources is non-trivial and requires the user's manual configuration. Also, Jupyter Notebooks per se cannot provide much scalability (they execute locally and sequentially) and reproducibility (users can run cells in any order and any number of times).

The Dossier platform allows system administrators to manage multi-tenant distributed Jupyter Notebooks at the cluster level in the Kubernetes way, i.e. through CRDs. Namespaces are aggregated in Tenants, and all security and accountability aspects are managed at that level. Each Notebook spawns into a user-dedicated namespace, subject to all Tenant-level constraints. Users can rely on provisioned resources, either in-cluster worker nodes or external resources like HPC facilities. Plus, they can plug their computing nodes in a BYOD fashion. Notebooks are interpreted as distributed workflows, where each cell is a task that one can offload to a different location in charge of its execution.


KEY TAKE-AWAYS FROM THE TALK

From this talk, people will learn:

- The different requirements of Data analysis as a service

- How to configure for multi-tenancy at the cluster level with Capsule

- How to write distributed workflows as Notebooks with Jupyter Workflows

- How to combine all these aspects into a single platform: Dossier

All the software presented in the talk is OpenSource, so attendees can directly play with them and include them in their experiments with no additional restrictions.


OTHER LINKS

https://streamflow.di.unito.it/

https://jupyter-workflow.di.unito.it/

https://capsule.clastix.io/

https://github.com/clastix/capsule

Speakers


Host

  • Bart Farrell

    Bart Farrell

    Data on Kubernetes Community

    CNCF Ambassador

    See Bio

Organizers