Cloud Optimized Geotiffs + Pangeo best practices

scottyhq · November 5, 2020, 3:03am

Hi everyone, I’m attempting to consolidate collective wisdom on working with Cloud Optimized Geotiffs on Pangeo infrastructure. Mainly focusing on using Xarray and Dask to efficiently run computations against collections of COGs from within the same Cloud datacenter.

I think there is quite a bit to explore here, so I’ve decided to put together a repository to add several notebooks to get started. More details in the repository readme:

I think there are already some helpful resources and examples in the notebooks, for example, how setting GDAL environment variables when reading COGs can improve opening speed from seconds to milliseconds (GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR)! Examples of how to combine many 2D COGS into a 3D DataArray… etc

My hope is that folks who are interested in this can leverage our current infrastructure and dig into a common dataset to address questions that often come up:

Should we use processes or threads when working with a Dask LocalCluster?
Do we have rules of thumb for CPU,RAM,nthreads for a Dask GatewayCluster options?
How do file locks for concurrent reading and writing impact dask task graph execution?

I’d particularly love some feedback from folks versed in interpreting dask diagnostics for efficiency in the LocalCluster and GatewayCluster examples. Ultimately we can consolidate findings in a new pangeo gallery or blog post!

rabernat · November 5, 2020, 3:19pm

This is a great post and amazing set of examples @scottyhq!

What do you think about sharing this via the Pangeo twitter account to solicit feedback. There are a lot of COG / geospatial people out there who probably don’t monitor this forum…

scottyhq · November 5, 2020, 4:42pm

Thanks @rabernat - Good idea. I just sent out a tweet, so please do retweet to get more visibility.

RichardScottOZ · November 6, 2020, 5:20am

This is good, thanks!

RichardScottOZ · January 21, 2021, 7:12am

Considering a couple of the lambda examples - the important part is how many milliseconds? e.g. what would give us definitely the best performance if considering going that way.

Topic		Replies	Views
Exploring Pangeo's Data Processing Capabilities for Large-Scale Climate Modeling! Data	0	69	January 31, 2025
Pangeo Showcase: "Cloud Native Data Loaders for Machine Learning Using Zarr and Xarray" Pangeo Showcase machine-learning	6	929	October 25, 2024
Pangeo Workshop at FOSS4G	22	1598	August 16, 2022
Pangeo Cloud Data Cookbook Cloud	5	1338	March 25, 2021
Large Scale Geospatial Benchmarks News & Announcements	2	220	October 22, 2024

Cloud Optimized Geotiffs + Pangeo best practices

Related topics