Pangeo Batch workflows

This has come up a few things, most recently on the pangeo ML working group and cloud devops calls.

Motivation

Pangeo users would like a way to submit long-running tasks that aren’t necessarily tied to
the lifetime of a jupyterlab session. Examples include

  • Training a machine learning model
  • Performing a large ETL job

Demo

The introduction of Dask-Gateway does get us partway there. It (optionally)
decouples the lifetime of the Dask Cluster from the lifetime of the notebook
kernel. So we can piece together a batch-style workflow with some effort.

My prototype is at https://github.com/TomAugspurger/pangeo-batch-demo. The
summary:

  1. Connect to the Gateway from outside the hub.
    auth = JupyterHubAuth(os.environ["PANGEO_TOKEN"])
    # Proxy address will be made easier to find.
    gateway = Gateway(
        address="https://staging.us-central1-b.gcp.pangeo.io/services/dask-gateway/",
        proxy_address="tls://104.197.142.28:8786",
        auth=auth
    )

This requires

  1. Start a Cluster that stays alive
    cluster = gateway.new_cluster(shutdown_on_close=False)

This is the batch-style workflow, where the client (a Python process running on my laptop) can go away but the computation continues.

  1. Submit your job
    fut = client.submit(main)
    fire_and_forget(fut)

We’re building of dask here, so our job submission is a Python function that we submit with the Dask client. We also call fire_and_forget on the result to ensure that the computation completes, even if all the clients go away.

In my example, the job does a grid search using joblib’s Dask backend. The
training happens on the cluster.

Issues

There are plenty of issues with this approach.

  1. Users don’t have great visibility into the status of jobs, what’s been submitted.
    The Dask Dashboard is available, but that’s about it.
  2. Clusters aren’t shut down. The shutdown_on_close argument lets us disconnect
    the client, but there’s no easy way to say “shut this down when my main job
    completes”. You’d need to manually reconnect to the gateway and shut things down,
    or wait for pangeo to shut it down for you after 60 minutes of inactivity.
  3. The main job is running on a Dask worker, which can cause strange things
    (e.g. trying to close the cluster from within a worker is… strange).

Issues 2 and 3 point to the need for something to manage the computation other
than the cluster itself. I think the possible options are either JupyterHub or
perhaps Dask Gateway. Even once that’s implemented, we’d still want some basic
reporting on what jobs users have submitted and their output.

Hopefully this helps guide future development a bit. If anyone has additional
use cases it’d be great to collect them here. If even this hacky setup is helpful
then let me know and I can make things a bit easier to use (like getting a stable
IP for the proxy address).

2 Likes

Tom I just saw this post now and wish I had looked at it before opening

Thanks for sharing your thoughts.