.to_zarr(..., compute=False) gives zeros not NaNs

I was hoping that _FillValue would be used if no chunks have been written yet. For example,

import dask.array as da
import numpy as np
import xarray as xr

ex = xr.Dataset({"data": xr.DataArray(da.full((10, 10), fill_value=np.nan))})

ex.to_zarr("~/tmp/test.zarr", mode="w", compute=False)
rt = xr.open_zarr("~/tmp/test.zarr").compute()
assert rt["data"].isnull().all()

raises, when I am actually hoping they would be nan by default. Any idea how to control which value is loaded from empty chunks? I’d like to default this value to the no data value or _FillValue.

It looks like with zarr format 3, we can accomplish this with:

ex = xr.Dataset({"data": xr.DataArray(da.full((10, 10), fill_value=np.nan))})
ex['data'].encoding = {"fill_value": np.nan}
ex.to_zarr("~/tmp/test.zarr", mode="w", compute=False, zarr_format=3)
rt = xr.open_zarr("~/tmp/test.zarr").compute()
assert rt["data"].isnull().all()

The encoding params are missing when to_zarr(). This works on my side, give it a try:

# Dataset with datavar: 'data', fill_value=np.nan, and dtype='float32':
ds = xr.Dataset(...)

ds.to_zarr(
    store=...,
    compute=False,
    zarr_format=3,
    consolidated=False,
    write_empty_chunks=False,
    encoding={
        'data': {'dtype': np.float32, 'fill_value': np.nan}
    }
)
2 Likes

The question of missing value / fill value is a bit of a can of worms in Zarr. Here is my brain dump.

Current Status

  • Zarr itself formally has no concept of a “mask” or null values for an array. It only has fill_value, the value returned when an entire chunk is not found in the store. This is a part of Array metadata.

    • In Zarr V2, fill_value was optional, leading to undefined behavior when accessing such values
    • In Zarr V3, this has been corrected, making fill_value mandatory.
  • Meanwhile, Xarray implemented its own logic around missing data using the _FillValue attribute, using concepts from the NetCDF User Guide and CF conventions. Here’s a relevant quote:

    The scalar attribute with the name _FillValue and of the same type as its variable is recognized by the netCDF library as the value used to pre-fill disk space allocated to the variable. This value is considered to be a special value that indicates undefined or missing data, and is returned when reading values that were not written

    Note that this _FillValue also represents un-initialized data in NetCDF, but CF adds the additional interpretation that this is equivalent to “undefined or missing data.” Xarray implements encoding and decoding and of _FillValue attributes, turning them into NaNs in memory.

    • For Zarr V2 data, the Zarr fill_value property was hijacked to represent the CF-style _FillValue. This worked because fill_value was optional. That was no longer possible for V3 data, because fill_value became mandatory.
    • For Zarr V3 data, Xarray will now set the _FillValue attribute, as it does for NetCDF data, independently of the array fill_value. To recover the old behavior, you can specify open_zarr(..., use_zarr_fill_value_as_mask=True). (However, this behavior is currently broken; see `use_zarr_fill_value_as_mask=True` is ignored in `open_zarr` · Issue #10269 · pydata/xarray · GitHub)

I agree this is all extremely confusing. For some deep background, check out this thread.

Also relevant:

Where to go from here

We are currently relying on a pretty fragile and poorly documented set of assumptions around how to handle missing data in Xarray / Zarr. We should overcome this formalizing the concept of missing data / null values at the Zarr level. That way, Xarray could basically skip its own encoding / decoding of fill values and rely on Zarr to do it.

The place to start would probably be to promote _FillValue to a “registered attribute” of Zarr and implement mask encoding / decoding within Zarr itself. See https://github.com/zarr-developers/zeps/pull/67#issuecomment-3220214413 for some discussion of that.

Going further, we could attempt to copy some of the good ideas in other formats like Arrow / Parquet, which actually store a null-value mask as an independent buffer alongside the array values.

1 Like

I have always struggled with fill_value / nodata concepts in odc.{geo,stac}. I think it’s due to the fact that it tries to be two similar but different things: sentinel value for NaN, when working with integer or fixed width float types and just “fill value” for pixels that do not overlap with the source imagery. We had to add fairly involved dtype dependent logic for figuring out what value to default to when not supplied by the user or the underlying source data.

Most of the time “fill value” is also a “sentinel for NaN", but not always. Take 8 bit RGB visual sources as an example, all black pixel 0,0,0 could be a reasonable fill value for blocks that were not recorded, but it can also appear as part of the valid data, so can not be relied on to build “valid pixel mask”.

To me it feels wrong to use anything but NaN as a fill value for floating point data, but some data sources do that, while also having NaNs in the data… In fact even defining that on float data feels wrong, as it’s part of IEEE float spec already, so one should assume that NaN might be present.

Somewhat off topic: how does zarr handle NaN values in the metadata sections, given that it’s JSON under the hood, and it ain’t valid for JSON to have NaN or inf?

Perhaps we are going off topic, but Kirill, I completely agree with your message. The meaning of “missing data” is very context-dependent, and different applications need to handle it in very different ways. For floating-point data in Zarr, it is always possible to just put NaNs directly in the data, without any special metadata fill_value, _FillValue, etc. But it’s up to the user to interpret what this means. For integer types, the lack of a NaN makes this much harder.

Good question. Zarr uses the string “Nan”: Zarr core specification — Zarr specs documentation

In Python, it does this by calling `json.dumps(…, allow_nan=True). (See json — JSON encoder and decoder — Python 3.13.7 documentation)

1 Like