Writing larger-than-memory COG from many NetCDF files

ZZMitch · October 21, 2025, 2:15pm

Hello,

Over the last few days I have been struggling with this issue and was hoping to get some tips from the Pangeo community…

I have 3390 NetCDF files that together represent environmental data at 30 m spatial resolution over a large area (all of Canada). I would like to extract individual variables from this dataset as nation-wide COGs for visualization and further analysis.

I have a processing chain that finds all the NetCDF files I want and uses xarray’s open_mfdataset to turn them into a dataset from which I grab variables of interest as a dataarray, e.g.:

merged = xr.open_mfdataset(ncs_filtered, # List of NetCDF file paths
                           chunks = 'auto', # NetCDF files are 2001x2001
                           parallel = False, # True leads to missing/wrong location tiles in some cases
                           preprocess = remove_edge, # Remove 1 pixel overlap along 2 edges
                           engine = 'h5netcdf')

da = merged[variable]
da

For regional areas, I have simply used rio.to_raster to get my COG, e.g.,

da.rio.write_nodata(-32767, encoded = True, inplace = True)
da.rio.to_raster(os.path.join(output, 'output.tif'), **cog_kwargs)

However, at the national-scale - I run out of memory (my workstation has 128 GB RAM, for reference).

I have tried options that are reported to help with writing large rasters, e.g.,:

da.rio.to_raster(os.path.join(output, 'output.tif'), tiled = True, windowed = True, BIGTIFF = 'YES',  **cog_kwargs)

and I don’t run out of memory, but the process will seemingly take a very long time to run. I left this running for about 24 hours and it did not complete.

I was monitoring my dask client, e.g.,

from dask.distributed import LocalCluster

cluster = LocalCluster(n_workers = 10, threads_per_worker = 10, memory_limit = 0)
client = cluster.get_client()
client

and I could see processing ongoing like this

with some processing completing every second or two. I thought this was each tile being processed, but it seemed to be running for longer than it would take for my 3390 NetCDF files or 6840 chunks in the dataarray. So I don’t really have a sense for when or if this would finish (and I have dozens of variables I want to write to COG - so waiting multiple days per variable is not really viable).

Any tips are appreciated! I will continue doing research and trying different options.

RichardScottOZ · October 22, 2025, 11:06am

Wrapt it in a vrt so you get windowed advantage? Couple of rioxarray issues like this.

kirill.kzb · October 25, 2025, 6:04am

@ZZMitch odc.geo offers Dask based COG creation. It supports distributed Dask cluster (assumes shared storage or S3 destination) and larger than cluster memory use-cases.

Method is odc.geo.save_cog_with_dask

But as always, large scale processing is hard and your bottlenecks could be on the source side so it might not help your case necessarily.

ZZMitch · October 27, 2025, 6:25pm

Thanks for the ideas! What ended up consistently working for me was just setting tiled = True in to_raster() on a machine with ~400 GB RAM.

I was able to get it working sometimes on my ~100 GB RAM workstation, by following advice here as well: Limits of open_mfdataset for many netcdf files · pydata/xarray · Discussion #8925 · GitHub . Including settings for locking along with tiling/windowing to reduce RAM use. However, it would not consistently work - with intermittent errors.

Topic		Replies	Views
Using Dask client and running out of memory Science	8	4485	June 22, 2023
Problem plotting large memory dataset Visualization	7	234	April 28, 2025
Reading a Larger than RAM NetCDF4 using Xarray Data zarr	7	270	June 24, 2025
Xarray unable to allocate memory, How to "size up" problem Data location-uw	9	3415	July 27, 2023
Chunk size for reading writing netcdfs Data	0	583	September 28, 2021

Writing larger-than-memory COG from many NetCDF files

Related topics