Hello,
Over the last few days I have been struggling with this issue and was hoping to get some tips from the Pangeo community…
I have 3390 NetCDF files that together represent environmental data at 30 m spatial resolution over a large area (all of Canada). I would like to extract individual variables from this dataset as nation-wide COGs for visualization and further analysis.
I have a processing chain that finds all the NetCDF files I want and uses xarray’s open_mfdataset to turn them into a dataset from which I grab variables of interest as a dataarray, e.g.:
merged = xr.open_mfdataset(ncs_filtered, # List of NetCDF file paths
chunks = 'auto', # NetCDF files are 2001x2001
parallel = False, # True leads to missing/wrong location tiles in some cases
preprocess = remove_edge, # Remove 1 pixel overlap along 2 edges
engine = 'h5netcdf')
da = merged[variable]
da
For regional areas, I have simply used rio.to_raster to get my COG, e.g.,
da.rio.write_nodata(-32767, encoded = True, inplace = True)
da.rio.to_raster(os.path.join(output, 'output.tif'), **cog_kwargs)
However, at the national-scale - I run out of memory (my workstation has 128 GB RAM, for reference).
I have tried options that are reported to help with writing large rasters, e.g.,:
da.rio.to_raster(os.path.join(output, 'output.tif'), tiled = True, windowed = True, BIGTIFF = 'YES', **cog_kwargs)
and I don’t run out of memory, but the process will seemingly take a very long time to run. I left this running for about 24 hours and it did not complete.
I was monitoring my dask client, e.g.,
from dask.distributed import LocalCluster
cluster = LocalCluster(n_workers = 10, threads_per_worker = 10, memory_limit = 0)
client = cluster.get_client()
client
and I could see processing ongoing like this
with some processing completing every second or two. I thought this was each tile being processed, but it seemed to be running for longer than it would take for my 3390 NetCDF files or 6840 chunks in the dataarray. So I don’t really have a sense for when or if this would finish (and I have dozens of variables I want to write to COG - so waiting multiple days per variable is not really viable).
Any tips are appreciated! I will continue doing research and trying different options.

