Hi everyone, I’m attempting to consolidate collective wisdom on working with Cloud Optimized Geotiffs on Pangeo infrastructure. Mainly focusing on using Xarray and Dask to efficiently run computations against collections of COGs from within the same Cloud datacenter.
I think there is quite a bit to explore here, so I’ve decided to put together a repository to add several notebooks to get started. More details in the repository readme:
I think there are already some helpful resources and examples in the notebooks, for example, how setting GDAL environment variables when reading COGs can improve opening speed from seconds to milliseconds (GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR)! Examples of how to combine many 2D COGS into a 3D DataArray… etc
My hope is that folks who are interested in this can leverage our current infrastructure and dig into a common dataset to address questions that often come up:
- Should we use processes or threads when working with a Dask LocalCluster?
- Do we have rules of thumb for CPU,RAM,nthreads for a Dask GatewayCluster options?
- How do file locks for concurrent reading and writing impact dask task graph execution?
I’d particularly love some feedback from folks versed in interpreting dask diagnostics for efficiency in the LocalCluster and GatewayCluster examples. Ultimately we can consolidate findings in a new pangeo gallery or blog post!