Memory requirements tor converting a netcdf multifile dataset to zarr

Hi,

For more context see further down, but my question is essentially this:

If I do a xr.open_mfdataset on a bunch of netcdf files, then some rechunking, then a ds.to_zarr - will xarray read and write one variable at a time (as a worst case, if rechunking means we need an entire variable in memory at a given time)?

But that’s not how it seems to work. Could anyone bother looking through my rapid calculation and tell me where I’m wrong, and maybe even what I could improve?

Thanks in advance!

I had a job running overnight that did essentially this:

with xr.open_mfdataset(netcdf_files) as ds:
    ds = ds.chunk({'siglay_dim': 34, 'siglev_dim': 35, 'time': 24*120, 'node': 100, 'nele': 100})
    ds.to_zarr('path/to/zarr')

where I’d reserved

#SBATCH --ntasks=4 --cpus-per-task=1 --mem-per-cpu=32G

=128 GB for the job which I thought was plenty, but it still got killed b/c it ran out of memory.

The files contain 24 time steps each, are about 9.9 GB, in total something like 6TB.
The individal variable in each file is about 45 MB, so for 600 files we’re talking about

600*ds.temp.nbytes/1024**2

= 27 GB, which I though is smaller than 32 GB per task so I should be good.

<xarray.Dataset>
Dimensions:                           (nele: 28173, node: 14658,
                                       siglay_dim: 34, siglev_dim: 35,
                                       three: 3, time: 24, maxnode: 11,
                                       maxelem: 9, four: 4)
Coordinates:
    x                                 (node) float32 4.881e+05 ... 4.839e+05
    y                                 (node) float32 7.636e+06 ... 7.624e+06
    lon                               (node) float32 0.0 0.0 0.0 ... 0.0 0.0 0.0
    lat                               (node) float32 0.0 0.0 0.0 ... 0.0 0.0 0.0
    lonc                              (nele) float32 0.0 0.0 0.0 ... 0.0 0.0 0.0
    latc                              (nele) float32 0.0 0.0 0.0 ... 0.0 0.0 0.0
  * time                              (time) float32 5.812e+04 ... 5.812e+04
    z                                 (siglay_dim, node) float32 -0.01611 ......
    zlev                              (siglev_dim, node) float32 0.0 ... -5.003
Dimensions without coordinates: nele, node, siglay_dim, siglev_dim, three,
                                maxnode, maxelem, four
Data variables: (12/87)
    nprocs                            int32 640
    partition                         (nele) int32 535 535 535 ... 214 214 214
    xc                                (nele) float32 4.881e+05 ... 4.84e+05
    yc                                (nele) float32 7.636e+06 ... 7.624e+06
    siglay_center                     (siglay_dim, nele) float32 -0.0008845 ....
    siglev_center                     (siglev_dim, nele) float32 0.0 ... -1.0
    ...                                ...
Attributes: ...
1 Like

Could you try using rechunker for the rechunking portion of the workflow? That uses some smarter algorithms to do the rechunking while keeping memory under control.

rechunker does accept an xarray.Dataset as an input, but I’m not sure if it’s been used to do both conversion from NetCDF to Zarr and the rechunking simultaneously. If things don’t work out of the box, you might use pangeo-forge-recipes for the NetCDF → Zarr step, followed by rechunker for the rechunking.

1 Like

Thanks Tom, I will check out. rechunker and pangeo forge. I guess I can also try to force xarray to do one variable at a time then.

Meanwhile, is there a good way to make xarray or dask more verbose about which variable it is reading from which file and when?

1 Like

One thing that wasn’t quite intuitive for me but might apply in this case is that chunking after opening NetCDF files doesn’t seem to override the chunk scheme that was done at the time of opening.

Wondering if something like this might help?

with xr.open_mfdataset(netcdf_files, chunks={'siglay_dim': 34, 'siglev_dim': 35, 'time': 24*120, 'node': 100, 'nele': 100}) as ds:
    ds.to_zarr('path/to/zarr')

I haven’t quite figured out how to best use Rechunker when starting from a set of NetCDF files - on the other hand, it works wonderfully when going from Zarr to Zarr.