Go multi regional with Dask (AWS)

oconpa · January 4, 2023, 12:09pm

OpenSource project I’ve been heavily involved in.
Some cool things it does. Reduce data replication/minimise data movement, Integrate with Dask, Cloud solution, Dask across AWS Regions & Jupyter Notebook interface

Code

TomAugspurger · January 4, 2023, 2:41pm

Very cool. In distributed-compute-on-aws-with-cross-regional-dask/ux_notebook.ipynb at 836dfb4678b83167f41ccee311e3bcc161efac46 · aws-samples/distributed-compute-on-aws-with-cross-regional-dask · GitHub, how where does dask_worker_pools come from? I couldn’t find it.

When we did GitHub - pangeo-data/multicloud-demo: Notebooks and infrastructure for Earthcube2020: Multi-Cloud workflows with Pangeo and Dask Gateway a few years back, a difficulty was that the user had to know where the data lived and was responsible for ensuring that the “right” Dask workers were used for each computation. So we had things like

era5_tp_hist_ = gcp_client.compute(tp_hist, retries=5)
lens_hist_ = aws_client.compute(lens_hist)

i.e. a region-specific client, and users are explicitly computing results with that client. It’d be fun (and extremely challenging, I think) to use dataset metadata to inform where compute resources should be created, and which ones should be used for a particular stage of the computation.

oconpa · January 4, 2023, 3:19pm

Good call Tom. We’re using this library GitHub - gjoseph92/dask-worker-pools: Assign tasks to pools of workers in dask by gjoseph92 to selectively tell Dask which pool of workers to run on. It sounds like your previous project did a version of this by using the resource tag mechanism in Dask.

For the region location piece yes that was an annoyance having to select the region you wanted to run so how we bypassed this was by indexing metadata for each of the datasets we wanted to connect into OpenSearch as an index.
You configure these datasets before launching the stack here. You can customise this variable workers to include more datasets, in more regions etc.
The CDK package will set up some sync jobs to keep the index up to date (currently a daily refresh, but I think this could be optimised). You just need to query the index of the data you want, and the backend should figure out the region to query.

oconpa · January 4, 2023, 8:52pm

This notebook has an example in the repo. Search for function query_nc and you’ll see how you can abstract the region. Turn this into a library for Jupyter and you don’t need to worry

File: lib/SagemakerCode/get_historical_data.ipynb

Topic		Replies	Views
Get started on AWS w/ Dask News & Announcements	0	385	August 6, 2023
Cloud computing using NASA Earthdata with Earthdata login Cloud	29	2274	April 18, 2023
Some usage analysis of ocean.pangeo.io Cloud	9	1239	June 2, 2020
Cloud Optimized Geotiffs + Pangeo best practices Data	4	2081	January 21, 2021
Any interest in using Ray? Cloud HPC	2	870	September 24, 2021

Go multi regional with Dask (AWS)

Related topics