Some usage analysis of ocean.pangeo.io

rabernat · October 2, 2019, 9:57pm

I regularly monitor the resource usage on our google cloud clusters, particularly ocean.pangeo.io, which my group uses for our daily science work. This cluster lives in the dev-pangeo-io-cluster Kubernetes cluster. The daily cost of this cluster oscillates between $25 and $75 per day. However, in the past week I noticed a spike, with daily costs shooting up to $175 per day!

The cost is from preemptible nodes, indicating this is coming from dask usage. This is confirmed by the fact that there are no anomalies in jupyter pod creation:

However, if I look at dask worker pod creation, I see a huge recent spike.

If I divide the number of dask worker pods per day by the number of dask clusters per day, I get the average number of workers per cluster.

Before September, this was normal, with < 50 workers per cluster. But now we are seeing clusters with huge numbers of workers.

This situation is not sustainable–we will run out of credits much too soon at this burn rate. Furthermore, it reveals several problems with our current configuration.

Users are able to create essentially an unlimited number of dask workers, up to the limit of the nodepool size.
We have provided no guidance to our users about best practices.
We have no way to contact our users (e.g. email list, etc.) to share such information
We have no way to track dask statistics to a specific user (see https://github.com/dask/dask-kubernetes/issues/173), so we don’t even know who the culprit is.

It’s clear that this is not the users’ fault. We have no systems in place to effectively monitor and limit usage, and we lack sufficient documentation and training materials to effectively inform users.

I welcome ideas on what to do now.

rabernat · October 2, 2019, 10:08pm

Another possibility we should consider is whether we can reduce the baseline cost. The core pool does not appear to be very tightly packed:

jhamman · October 3, 2019, 5:31am

@rabernat, thanks for bringing this up and for updating our analysis plots.

On the dask side, I think we should continue to push forward on dask-gateway so we can implement user controls. This puts us on a path to solve all the problems you listed above. If we move quickly, I think we can start transitioning people to gateway by next week sometime.

On pod packing, we have a few options. First and probably most realistically, we should be more specific about node pool configurations and pod resource allocations. The current state has been done in a fairly ad-hoc way. The other thing we can look into, and this mostly applies to GKE clusters, is node pool auto provisioning. This would essentially let us have many node pools that are configured based on requests at the k8s level. We wouldn’t have to manage packing pods into nodes, instead we’d let k8s do this.

rabernat · October 3, 2019, 1:28pm

Another persistent cost is the jupyter pool. While it theoretically is autoscaling, it seems to just be setting at 1 node the whole time:

This is a n1-highmem-16 node, which costs about $20 per day. We should use smaller machine types and take more advantage of autoscaling.

jhamman · October 4, 2019, 4:45am

FWIW, I’ve reworked the binder.pangeo.io’s notebook pool to be slightly better packed and to use preemptible nodes. This should help cut down binder’s share of the costs.

rabernat · October 8, 2019, 2:11pm

So it happened again today that someone launched 300 dask workers. ocean.pangeo.io crashed. We need to do something more radical here.

TomAugspurger · October 8, 2019, 3:21pm

https://github.com/pangeo-data/pangeo-cloud-federation/pull/439 has a relatively simple fix, until we can get dask-gateway in place on all these hubs.

jhamman · October 8, 2019, 4:31pm

until we can get dask-gateway in place on all these hubs.

To this end, we need people to help out with the following PRs:

Get chartpress working on Dask-Gateway to automate chart publishing:

Adding Dask-Gateway to Pangeo Helm Chart:

rabernat · October 8, 2019, 4:33pm

Thanks for working on this Joe!

andersy005 · June 2, 2020, 8:18pm

Has anyone done another analysis of ocean.pangeo.io or some other hubs/binder hosted on AWS after the integration of dask-gateway into pangeo’s clusters? Ccing @scottyhq in case he has some numbers on hubs hosted on AWS.

Topic		Replies	Views
Setting-up a Pangeo hub for my university ? (Teaching) Cloud	8	946	September 30, 2020
Dask cluster stays idle for a long time before computing Pangeo Cloud Support	2	737	September 19, 2022
Cloud Optimized Geotiffs + Pangeo best practices Data	4	2081	January 21, 2021
Dask Kubernetes Setup Issue Cloud	2	393	October 5, 2023
Any interest in using Ray? Cloud HPC	2	870	September 24, 2021

Some usage analysis of ocean.pangeo.io

Related topics