OpenSource project I’ve been heavily involved in.
Some cool things it does. Reduce data replication/minimise data movement, Integrate with Dask, Cloud solution, Dask across AWS Regions & Jupyter Notebook interface
Very cool. In distributed-compute-on-aws-with-cross-regional-dask/ux_notebook.ipynb at 836dfb4678b83167f41ccee311e3bcc161efac46 · aws-samples/distributed-compute-on-aws-with-cross-regional-dask · GitHub, how where does dask_worker_pools
come from? I couldn’t find it.
When we did GitHub - pangeo-data/multicloud-demo: Notebooks and infrastructure for Earthcube2020: Multi-Cloud workflows with Pangeo and Dask Gateway a few years back, a difficulty was that the user had to know where the data lived and was responsible for ensuring that the “right” Dask workers were used for each computation. So we had things like
era5_tp_hist_ = gcp_client.compute(tp_hist, retries=5)
lens_hist_ = aws_client.compute(lens_hist)
i.e. a region-specific client, and users are explicitly computing results with that client. It’d be fun (and extremely challenging, I think) to use dataset metadata to inform where compute resources should be created, and which ones should be used for a particular stage of the computation.
Good call Tom. We’re using this library GitHub - gjoseph92/dask-worker-pools: Assign tasks to pools of workers in dask by gjoseph92 to selectively tell Dask which pool of workers to run on. It sounds like your previous project did a version of this by using the resource tag mechanism in Dask.
For the region location piece yes that was an annoyance having to select the region you wanted to run so how we bypassed this was by indexing metadata for each of the datasets we wanted to connect into OpenSearch as an index.
You configure these datasets before launching the stack here. You can customise this variable workers to include more datasets, in more regions etc.
The CDK package will set up some sync jobs to keep the index up to date (currently a daily refresh, but I think this could be optimised). You just need to query the index of the data you want, and the backend should figure out the region to query.
This notebook has an example in the repo. Search for function query_nc and you’ll see how you can abstract the region. Turn this into a library for Jupyter and you don’t need to worry
File: lib/SagemakerCode/get_historical_data.ipynb