Cloud Optimized Geotiffs + Pangeo best practices

Hi everyone, I’m attempting to consolidate collective wisdom on working with Cloud Optimized Geotiffs on Pangeo infrastructure. Mainly focusing on using Xarray and Dask to efficiently run computations against collections of COGs from within the same Cloud datacenter.

I think there is quite a bit to explore here, so I’ve decided to put together a repository to add several notebooks to get started. More details in the repository readme:

I think there are already some helpful resources and examples in the notebooks, for example, how setting GDAL environment variables when reading COGs can improve opening speed from seconds to milliseconds (GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR)! Examples of how to combine many 2D COGS into a 3D DataArray… etc

My hope is that folks who are interested in this can leverage our current infrastructure and dig into a common dataset to address questions that often come up:

  • Should we use processes or threads when working with a Dask LocalCluster?
  • Do we have rules of thumb for CPU,RAM,nthreads for a Dask GatewayCluster options?
  • How do file locks for concurrent reading and writing impact dask task graph execution?

I’d particularly love some feedback from folks versed in interpreting dask diagnostics for efficiency in the LocalCluster and GatewayCluster examples. Ultimately we can consolidate findings in a new pangeo gallery or blog post!

9 Likes

This is a great post and amazing set of examples @scottyhq!

What do you think about sharing this via the Pangeo twitter account to solicit feedback. There are a lot of COG / geospatial people out there who probably don’t monitor this forum…

1 Like

Thanks @rabernat - Good idea. I just sent out a tweet, so please do retweet to get more visibility.

1 Like

This is good, thanks!

1 Like

Considering a couple of the lambda examples - the important part is how many milliseconds? e.g. what would give us definitely the best performance if considering going that way.