This has come up a few things, most recently on the pangeo ML working group and cloud devops calls.
Motivation
Pangeo users would like a way to submit long-running tasks that aren’t necessarily tied to
the lifetime of a jupyterlab session. Examples include
- Training a machine learning model
- Performing a large ETL job
- …
Demo
The introduction of Dask-Gateway does get us partway there. It (optionally)
decouples the lifetime of the Dask Cluster from the lifetime of the notebook
kernel. So we can piece together a batch-style workflow with some effort.
My prototype is at https://github.com/TomAugspurger/pangeo-batch-demo. The
summary:
- Connect to the Gateway from outside the hub.
auth = JupyterHubAuth(os.environ["PANGEO_TOKEN"])
# Proxy address will be made easier to find.
gateway = Gateway(
address="https://staging.us-central1-b.gcp.pangeo.io/services/dask-gateway/",
proxy_address="tls://104.197.142.28:8786",
auth=auth
)
This requires
- Having a JupyterHub token (which you can create in the JupyterHub UI)
- Making the Gateway accessible from the public internet (explored in https://github.com/pangeo-data/pangeo-cloud-federation/issues/694). As noted, we’d make a static URL for the proxy address (or remove the need to specify it).
- Start a Cluster that stays alive
cluster = gateway.new_cluster(shutdown_on_close=False)
This is the batch-style workflow, where the client (a Python process running on my laptop) can go away but the computation continues.
- Submit your job
fut = client.submit(main)
fire_and_forget(fut)
We’re building of dask here, so our job submission is a Python function that we submit with the Dask client. We also call fire_and_forget
on the result to ensure that the computation completes, even if all the clients go away.
In my example, the job does a grid search using joblib’s Dask backend. The
training happens on the cluster.
Issues
There are plenty of issues with this approach.
- Users don’t have great visibility into the status of jobs, what’s been submitted.
The Dask Dashboard is available, but that’s about it. - Clusters aren’t shut down. The
shutdown_on_close
argument lets us disconnect
the client, but there’s no easy way to say “shut this down when mymain
job
completes”. You’d need to manually reconnect to the gateway and shut things down,
or wait for pangeo to shut it down for you after 60 minutes of inactivity. - The
main
job is running on a Dask worker, which can cause strange things
(e.g. trying to close the cluster from within a worker is… strange).
Issues 2 and 3 point to the need for something to manage the computation other
than the cluster itself. I think the possible options are either JupyterHub or
perhaps Dask Gateway. Even once that’s implemented, we’d still want some basic
reporting on what jobs users have submitted and their output.
Hopefully this helps guide future development a bit. If anyone has additional
use cases it’d be great to collect them here. If even this hacky setup is helpful
then let me know and I can make things a bit easier to use (like getting a stable
IP for the proxy address).