Memory requirements tor converting a netcdf multifile dataset to zarr

poplarShift · April 27, 2022, 8:47am

Hi,

For more context see further down, but my question is essentially this:

If I do a xr.open_mfdataset on a bunch of netcdf files, then some rechunking, then a ds.to_zarr - will xarray read and write one variable at a time (as a worst case, if rechunking means we need an entire variable in memory at a given time)?

But that’s not how it seems to work. Could anyone bother looking through my rapid calculation and tell me where I’m wrong, and maybe even what I could improve?

Thanks in advance!

I had a job running overnight that did essentially this:

with xr.open_mfdataset(netcdf_files) as ds:
    ds = ds.chunk({'siglay_dim': 34, 'siglev_dim': 35, 'time': 24*120, 'node': 100, 'nele': 100})
    ds.to_zarr('path/to/zarr')

where I’d reserved

#SBATCH --ntasks=4 --cpus-per-task=1 --mem-per-cpu=32G

=128 GB for the job which I thought was plenty, but it still got killed b/c it ran out of memory.

The files contain 24 time steps each, are about 9.9 GB, in total something like 6TB.
The individal variable in each file is about 45 MB, so for 600 files we’re talking about

600*ds.temp.nbytes/1024**2

= 27 GB, which I though is smaller than 32 GB per task so I should be good.

<xarray.Dataset>
Dimensions:                           (nele: 28173, node: 14658,
                                       siglay_dim: 34, siglev_dim: 35,
                                       three: 3, time: 24, maxnode: 11,
                                       maxelem: 9, four: 4)
Coordinates:
    x                                 (node) float32 4.881e+05 ... 4.839e+05
    y                                 (node) float32 7.636e+06 ... 7.624e+06
    lon                               (node) float32 0.0 0.0 0.0 ... 0.0 0.0 0.0
    lat                               (node) float32 0.0 0.0 0.0 ... 0.0 0.0 0.0
    lonc                              (nele) float32 0.0 0.0 0.0 ... 0.0 0.0 0.0
    latc                              (nele) float32 0.0 0.0 0.0 ... 0.0 0.0 0.0
  * time                              (time) float32 5.812e+04 ... 5.812e+04
    z                                 (siglay_dim, node) float32 -0.01611 ......
    zlev                              (siglev_dim, node) float32 0.0 ... -5.003
Dimensions without coordinates: nele, node, siglay_dim, siglev_dim, three,
                                maxnode, maxelem, four
Data variables: (12/87)
    nprocs                            int32 640
    partition                         (nele) int32 535 535 535 ... 214 214 214
    xc                                (nele) float32 4.881e+05 ... 4.84e+05
    yc                                (nele) float32 7.636e+06 ... 7.624e+06
    siglay_center                     (siglay_dim, nele) float32 -0.0008845 ....
    siglev_center                     (siglev_dim, nele) float32 0.0 ... -1.0
    ...                                ...
Attributes: ...

TomAugspurger · April 27, 2022, 3:30pm

Could you try using rechunker for the rechunking portion of the workflow? That uses some smarter algorithms to do the rechunking while keeping memory under control.

rechunker does accept an xarray.Dataset as an input, but I’m not sure if it’s been used to do both conversion from NetCDF to Zarr and the rechunking simultaneously. If things don’t work out of the box, you might use pangeo-forge-recipes for the NetCDF → Zarr step, followed by rechunker for the rechunking.

poplarShift · April 28, 2022, 7:13am

Thanks Tom, I will check out. rechunker and pangeo forge. I guess I can also try to force xarray to do one variable at a time then.

Meanwhile, is there a good way to make xarray or dask more verbose about which variable it is reading from which file and when?

josephyang · May 18, 2022, 2:34am

One thing that wasn’t quite intuitive for me but might apply in this case is that chunking after opening NetCDF files doesn’t seem to override the chunk scheme that was done at the time of opening.

Wondering if something like this might help?

with xr.open_mfdataset(netcdf_files, chunks={'siglay_dim': 34, 'siglev_dim': 35, 'time': 24*120, 'node': 100, 'nele': 100}) as ds:
    ds.to_zarr('path/to/zarr')

I haven’t quite figured out how to best use Rechunker when starting from a set of NetCDF files - on the other hand, it works wonderfully when going from Zarr to Zarr.

Topic		Replies	Views
Extremely slow rechunking of Zarr store with xarray Data	16	4151	October 22, 2021
Many netcdf to single zarr store using concurrent.futures Data	6	1458	March 29, 2022
Slow Zarr to Netcdf Data	4	657	April 7, 2021
High resolution time series; open_zarr question Science	3	1116	July 2, 2020
Netcdf to Zarr best practices Data	15	10843	September 16, 2025

Memory requirements tor converting a netcdf multifile dataset to zarr

Related topics