Hi all, I’m working on a project that involves taking a global climate dataset like CIL Global Downscaled Projections on Microsoft Planetary Computer for example, calculating climate indicators, and running an ensemble analysis for a few thousand points to support climate scenario analysis.
Input: list of lat Lon
Output: for each point, a few climate indicators in 1-2 scenarios, 10,50,90th percentile etc for a few time periods
Right now, I’m struggling to find a scalable solution that works for more than a few dozen lat-lon points - which still takes hours using dask and ~400gb of worker memory available to me. Obviously, I understand that climate data service providers are doing exactly this and charging a fortune for it but is there a way to achieve similar results without massive cloud resources?
If you’ve tackled similar problems—whether through cloud-based parallelization, optimized data pipelines, or efficient sampling methods—I’d love to hear your insights.
Your use-case has a lot of similarities to the analysis run by Oriana Chegwidden at CarbonPlan in GitHub - carbonplan/extreme-heat: data and analysis on extreme heat under a changing climate, so you may find some useful insights there.
As you may know, the chunking of the input datasets will strongly impact the memory usage and time for the workflow. The CIL dataset is an intermediate chunking scheme data (chunks are (365, 360, 360) in (time, lat, lon)), which is meant to work reasonably ok for both time series and spatial analysis. The intermediate chunking scheme means more data will need to be fetched for time series analysis, which will also impact time for workflows especially if you are far from the data location for Planetary Computer (West Europe IIRC). CarbonPlan’s downscaled data is a bit more oriented towards time series analysis (chunks are (2600, 72, 144) in (time, lat, lon). I think there also may be a version of the data that has all time points in a single chunk on Planetary Computer, which would be best suited for your workflow if your computational environment is close to the data (@norlandrhagen may remember more).
If you are using a non-cloud optimized file format like NetCDF (e.g., the NASA NEX downscaled dataset), you can reduce memory usage and time by temporarily creating an ARCO version with time-series optimized chunking or virtualizing the data (e.g., How we’re making it easier to work with open climate data – CarbonPlan).
The developers at Coiled (and maybe Earthmover?) have been devoting time to improving performance of Dask and Xarray for this type of workflow. Patrick Höfler will be giving a showcase talk in April on these improvements, but in general it’s worthwhile using new releases of Xarray and Dask.
Hours and 400GB of memory sounds rough
Perhaps if you don’t get enough feedback on your problem in this thread we could dive into your code during one of the Pangeo community meetings where no showcase is scheduled. Working towards accessible and meaningful analyses without massive cloud resources is a good description of why those meetings exist 
2 Likes
In addition to Max’s excellent reply, the details of this operation ^ really matter. What is a “time period”? Can you write a minimal example that replicated details like input chunk sizes and a rough version of the calculation?
2 Likes
Hi Deepak,
I’ve been trying to follow and apply the examples given by PAVICS to access, subset, create indices and ensemble the data, but using different datasets – albeit without understanding chunking and rechunking concepts very well, to be honest.
The time periods I am trying to average over are 10-30 year periods from daily data, so understand how that could add up.
These examples work well as written, but when trying to adapt to my use case with 100s-1000s+ of individual data points it just takes a very long time.
A minimal example would help us help you. For example use dask.array.random
to create an array of the same shape, and chunk sizes, with the right coordinates. And then add a version of the calculation you’re doing.