Writing larger-than-memory COG from many NetCDF files

Hello,

Over the last few days I have been struggling with this issue and was hoping to get some tips from the Pangeo community…

I have 3390 NetCDF files that together represent environmental data at 30 m spatial resolution over a large area (all of Canada). I would like to extract individual variables from this dataset as nation-wide COGs for visualization and further analysis.

I have a processing chain that finds all the NetCDF files I want and uses xarray’s open_mfdataset to turn them into a dataset from which I grab variables of interest as a dataarray, e.g.:

merged = xr.open_mfdataset(ncs_filtered, # List of NetCDF file paths
                           chunks = 'auto', # NetCDF files are 2001x2001
                           parallel = False, # True leads to missing/wrong location tiles in some cases
                           preprocess = remove_edge, # Remove 1 pixel overlap along 2 edges
                           engine = 'h5netcdf')

da = merged[variable]
da

For regional areas, I have simply used rio.to_raster to get my COG, e.g.,

da.rio.write_nodata(-32767, encoded = True, inplace = True)
da.rio.to_raster(os.path.join(output, 'output.tif'), **cog_kwargs) 

However, at the national-scale - I run out of memory (my workstation has 128 GB RAM, for reference).

I have tried options that are reported to help with writing large rasters, e.g.,:

da.rio.to_raster(os.path.join(output, 'output.tif'), tiled = True, windowed = True, BIGTIFF = 'YES',  **cog_kwargs)

and I don’t run out of memory, but the process will seemingly take a very long time to run. I left this running for about 24 hours and it did not complete.

I was monitoring my dask client, e.g.,

from dask.distributed import LocalCluster

cluster = LocalCluster(n_workers = 10, threads_per_worker = 10, memory_limit = 0)
client = cluster.get_client()
client

and I could see processing ongoing like this


with some processing completing every second or two. I thought this was each tile being processed, but it seemed to be running for longer than it would take for my 3390 NetCDF files or 6840 chunks in the dataarray. So I don’t really have a sense for when or if this would finish (and I have dozens of variables I want to write to COG - so waiting multiple days per variable is not really viable).

Any tips are appreciated! I will continue doing research and trying different options.

Wrapt it in a vrt so you get windowed advantage? Couple of rioxarray issues like this.

@ZZMitch odc.geo offers Dask based COG creation. It supports distributed Dask cluster (assumes shared storage or S3 destination) and larger than cluster memory use-cases.

Method is odc.geo.save_cog_with_dask

But as always, large scale processing is hard and your bottlenecks could be on the source side so it might not help your case necessarily.

Thanks for the ideas! What ended up consistently working for me was just setting tiled = True in to_raster() on a machine with ~400 GB RAM.

I was able to get it working sometimes on my ~100 GB RAM workstation, by following advice here as well: Limits of open_mfdataset for many netcdf files · pydata/xarray · Discussion #8925 · GitHub . Including settings for locking along with tiling/windowing to reduce RAM use. However, it would not consistently work - with intermittent errors.