I was hoping that _FillValue would be used if no chunks have been written yet. For example,
import dask.array as da
import numpy as np
import xarray as xr
ex = xr.Dataset({"data": xr.DataArray(da.full((10, 10), fill_value=np.nan))})
ex.to_zarr("~/tmp/test.zarr", mode="w", compute=False)
rt = xr.open_zarr("~/tmp/test.zarr").compute()
assert rt["data"].isnull().all()
raises, when I am actually hoping they would be nan by default. Any idea how to control which value is loaded from empty chunks? I’d like to default this value to the no data value or _FillValue.
The question of missing value / fill value is a bit of a can of worms in Zarr. Here is my brain dump.
Current Status
Zarr itself formally has no concept of a “mask” or null values for an array. It only has fill_value, the value returned when an entire chunk is not found in the store. This is a part of Array metadata.
In Zarr V2, fill_value was optional, leading to undefined behavior when accessing such values
In Zarr V3, this has been corrected, making fill_value mandatory.
Meanwhile, Xarray implemented its own logic around missing data using the _FillValue attribute, using concepts from the NetCDF User Guide and CF conventions. Here’s a relevant quote:
The scalar attribute with the name _FillValue and of the same type as its variable is recognized by the netCDF library as the value used to pre-fill disk space allocated to the variable. This value is considered to be a special value that indicates undefined or missing data, and is returned when reading values that were not written
Note that this _FillValue also represents un-initialized data in NetCDF, but CF adds the additional interpretation that this is equivalent to “undefined or missing data.” Xarray implements encoding and decoding and of _FillValue attributes, turning them into NaNs in memory.
For Zarr V2 data, the Zarr fill_value property was hijacked to represent the CF-style _FillValue. This worked because fill_value was optional. That was no longer possible for V3 data, because fill_value became mandatory.
I agree this is all extremely confusing. For some deep background, check out this thread.
Also relevant:
Where to go from here
We are currently relying on a pretty fragile and poorly documented set of assumptions around how to handle missing data in Xarray / Zarr. We should overcome this formalizing the concept of missing data / null values at the Zarr level. That way, Xarray could basically skip its own encoding / decoding of fill values and rely on Zarr to do it.
Going further, we could attempt to copy some of the good ideas in other formats like Arrow / Parquet, which actually store a null-value mask as an independent buffer alongside the array values.
I have always struggled with fill_value / nodata concepts in odc.{geo,stac}. I think it’s due to the fact that it tries to be two similar but different things: sentinel value for NaN, when working with integer or fixed width float types and just “fill value” for pixels that do not overlap with the source imagery. We had to add fairly involved dtype dependent logic for figuring out what value to default to when not supplied by the user or the underlying source data.
Most of the time “fill value” is also a “sentinel for NaN", but not always. Take 8 bit RGB visual sources as an example, all black pixel 0,0,0 could be a reasonable fill value for blocks that were not recorded, but it can also appear as part of the valid data, so can not be relied on to build “valid pixel mask”.
To me it feels wrong to use anything but NaN as a fill value for floating point data, but some data sources do that, while also having NaNs in the data… In fact even defining that on float data feels wrong, as it’s part of IEEE float spec already, so one should assume that NaN might be present.
Somewhat off topic: how does zarr handle NaN values in the metadata sections, given that it’s JSON under the hood, and it ain’t valid for JSON to have NaN or inf?
Perhaps we are going off topic, but Kirill, I completely agree with your message. The meaning of “missing data” is very context-dependent, and different applications need to handle it in very different ways. For floating-point data in Zarr, it is always possible to just put NaNs directly in the data, without any special metadata fill_value, _FillValue, etc. But it’s up to the user to interpret what this means. For integer types, the lack of a NaN makes this much harder.