.to_zarr(..., compute=False) gives zeros not NaNs

Len · September 5, 2025, 7:25pm

I was hoping that _FillValue would be used if no chunks have been written yet. For example,

import dask.array as da
import numpy as np
import xarray as xr

ex = xr.Dataset({"data": xr.DataArray(da.full((10, 10), fill_value=np.nan))})

ex.to_zarr("~/tmp/test.zarr", mode="w", compute=False)
rt = xr.open_zarr("~/tmp/test.zarr").compute()
assert rt["data"].isnull().all()

raises, when I am actually hoping they would be nan by default. Any idea how to control which value is loaded from empty chunks? I’d like to default this value to the no data value or _FillValue.

Len · September 5, 2025, 7:25pm

It looks like with zarr format 3, we can accomplish this with:

ex = xr.Dataset({"data": xr.DataArray(da.full((10, 10), fill_value=np.nan))})
ex['data'].encoding = {"fill_value": np.nan}
ex.to_zarr("~/tmp/test.zarr", mode="w", compute=False, zarr_format=3)
rt = xr.open_zarr("~/tmp/test.zarr").compute()
assert rt["data"].isnull().all()

sotosoul · September 6, 2025, 7:12pm

The encoding params are missing when to_zarr(). This works on my side, give it a try:

# Dataset with datavar: 'data', fill_value=np.nan, and dtype='float32':
ds = xr.Dataset(...)

ds.to_zarr(
    store=...,
    compute=False,
    zarr_format=3,
    consolidated=False,
    write_empty_chunks=False,
    encoding={
        'data': {'dtype': np.float32, 'fill_value': np.nan}
    }
)

rabernat · September 8, 2025, 12:41pm

The question of missing value / fill value is a bit of a can of worms in Zarr. Here is my brain dump.

Current Status

Zarr itself formally has no concept of a “mask” or null values for an array. It only has fill_value, the value returned when an entire chunk is not found in the store. This is a part of Array metadata.
- In Zarr V2, fill_value was optional, leading to undefined behavior when accessing such values
- In Zarr V3, this has been corrected, making fill_value mandatory.
Meanwhile, Xarray implemented its own logic around missing data using the _FillValue attribute, using concepts from the NetCDF User Guide and CF conventions. Here’s a relevant quote:

The scalar attribute with the name _FillValue and of the same type as its variable is recognized by the netCDF library as the value used to pre-fill disk space allocated to the variable. This value is considered to be a special value that indicates undefined or missing data, and is returned when reading values that were not written

Note that this _FillValue also represents un-initialized data in NetCDF, but CF adds the additional interpretation that this is equivalent to “undefined or missing data.” Xarray implements encoding and decoding and of _FillValue attributes, turning them into NaNs in memory.
- For Zarr V2 data, the Zarr fill_value property was hijacked to represent the CF-style _FillValue. This worked because fill_value was optional. That was no longer possible for V3 data, because fill_value became mandatory.
- For Zarr V3 data, Xarray will now set the _FillValue attribute, as it does for NetCDF data, independently of the array fill_value. To recover the old behavior, you can specify open_zarr(..., use_zarr_fill_value_as_mask=True). (However, this behavior is currently broken; see `use_zarr_fill_value_as_mask=True` is ignored in `open_zarr` · Issue #10269 · pydata/xarray · GitHub)

I agree this is all extremely confusing. For some deep background, check out this thread.

github.com/pydata/xarray

Is `_FillValue` really the same as zarr's `fill_value`?

opened 04:03PM - 16 Jun 21 UTC

closed 03:57PM - 23 Oct 24 UTC

d70-t

topic-CF conventions topic-zarr

The zarr backend uses the `fill_value` of zarrs `.zarray` key as if it would be …the `_FillValue` according to [CF-Conventions](http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#missing-data): https://github.com/pydata/xarray/blob/1a7b285be676d5404a4140fc86e8756de75ee7ac/xarray/backends/zarr.py#L373 I think this interpretation of the `fill_value` is wrong and creates problems. Here's why: The [zarr v2 spec](https://zarr.readthedocs.io/en/stable/spec/v2.html#metadata) is still a little vague, but states that `fill_value` is > A scalar value providing the default value to use for uninitialized portions of the array, or null if no fill_value is to be used. Accordingly this value should be used to fill all areas of a variable which are not backed by a stored chunk with this value. This is also different from what [CF conventions state](http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#missing-data) (emphasis mine): > The scalar attribute with the name `_FillValue` and of the same type as its variable is recognized by the netCDF library as the value used to pre-fill disk space allocated to the variable. **This value is considered to be a special value that indicates undefined or missing data**, and is returned when reading values that were not written. The difference between the two is, that `fill_value` is **only** a background value, which just isn't stored as a chunk. But `_FillValue` is (possibly) a background value **and** is interpreted as not being valid data. In my opinion, this mix of `_FillValue` and `missing_value` could be considered a defect in the CF-Conventions, but probably that's far to late as many depend on this. Thinking of an example, when storing a density field (i.e. water droplets forming clouds) in a zarr dataset, it might be perfectly valid to set the `fill_value` to `0` and then store only chunks in regions of the atmosphere where clouds are actually present. In that case, `0` (i.e. no drops) would be a perfectly valid value, which just isn't stored. As most parts of the atmosphere are indeed cloud-free, this may save quite a bunch of storage. Other formats (e.g. [OpenVDB](https://www.openvdb.org)) commonly use this trick. --- The issue gets worse when looking into the upcoming [zarr v3 spec](https://zarr-specs.readthedocs.io/en/core-protocol-v3.0-dev/protocol/core/v3.0.html#array-metadata) where `fill_value` is described as: > Provides an element value to use for uninitialised portions of the Zarr array. > > If the data type of the Zarr array is Boolean then the value must be the literal `false` or `true`. If the data type is one of the integer data types defined in this specification, then the value must be a number with no fraction or exponent part and must be within the range of the data type. > > For any data type, if the `fill_value` is the literal `null` then the fill value is undefined and the implementation may use any arbitrary value that is consistent with the data type as the fill value. > > [...] Thus for boolean arrays, if the `fill_value` would be interpreted as a missing value indicator, only (missing, `True`) or (`False`, missing) arrays could be represented. A (`False`, `True`) array would not be possible. The issue applies similarly for integer types as well.

Also relevant:

github.com/pydata/xarray

Unwritten Zarr v3 arrays values should default to NaN

opened 11:49PM - 14 Aug 25 UTC

shoyer

bug topic-zarr

### What happened? With Zarr v2, Xarray conflated `_FillValue` and `fill_value`… (https://github.com/pydata/xarray/issues/5475), so unwritten data in a Zarr file is always decoded as NaN. In Zarr v3, this changed. Now, it appears that unwritten data in a Zarr file uses the default Zarr `fill_value=0`. In practice, this means that is Zarr metadata is written to disk (with `compute=False`) but no values are written, the data will be decoded by xarray as all zeros instead of all NaNs. This is perhaps unavoidable for integer data (there is no integer NaN), but for floats, we should default to a user controllable `fill_value` of NaN, which is clearly an invalid value. To reproduce: ```python import xarray import numpy as np ds = xarray.Dataset({'foo': ('x', np.ones(3))}).chunk() path = '/tmp/foo5.zarr' ds.to_zarr(path, compute=False) # all NaN with zarr v2, all zeros with zarr v3 print(xarray.open_zarr(path).compute()) ``` Possibly related: https://github.com/pydata/xarray/issues/10633, https://github.com/pydata/xarray/issues/10269

Where to go from here

We are currently relying on a pretty fragile and poorly documented set of assumptions around how to handle missing data in Xarray / Zarr. We should overcome this formalizing the concept of missing data / null values at the Zarr level. That way, Xarray could basically skip its own encoding / decoding of fill values and rely on Zarr to do it.

The place to start would probably be to promote _FillValue to a “registered attribute” of Zarr and implement mask encoding / decoding within Zarr itself. See https://github.com/zarr-developers/zeps/pull/67#issuecomment-3220214413 for some discussion of that.

Going further, we could attempt to copy some of the good ideas in other formats like Arrow / Parquet, which actually store a null-value mask as an independent buffer alongside the array values.

kirill.kzb · September 9, 2025, 3:49am

I have always struggled with fill_value / nodata concepts in odc.{geo,stac}. I think it’s due to the fact that it tries to be two similar but different things: sentinel value for NaN, when working with integer or fixed width float types and just “fill value” for pixels that do not overlap with the source imagery. We had to add fairly involved dtype dependent logic for figuring out what value to default to when not supplied by the user or the underlying source data.

Most of the time “fill value” is also a “sentinel for NaN", but not always. Take 8 bit RGB visual sources as an example, all black pixel 0,0,0 could be a reasonable fill value for blocks that were not recorded, but it can also appear as part of the valid data, so can not be relied on to build “valid pixel mask”.

To me it feels wrong to use anything but NaN as a fill value for floating point data, but some data sources do that, while also having NaNs in the data… In fact even defining that on float data feels wrong, as it’s part of IEEE float spec already, so one should assume that NaN might be present.

Somewhat off topic: how does zarr handle NaN values in the metadata sections, given that it’s JSON under the hood, and it ain’t valid for JSON to have NaN or inf?

rabernat · September 9, 2025, 1:12pm

Perhaps we are going off topic, but Kirill, I completely agree with your message. The meaning of “missing data” is very context-dependent, and different applications need to handle it in very different ways. For floating-point data in Zarr, it is always possible to just put NaNs directly in the data, without any special metadata fill_value, _FillValue, etc. But it’s up to the user to interpret what this means. For integer types, the lack of a NaN makes this much harder.

Good question. Zarr uses the string “Nan”: Zarr core specification — Zarr specs documentation

In Python, it does this by calling `json.dumps(…, allow_nan=True). (See json — JSON encoder and decoder — Python 3.13.7 documentation)

Topic		Replies	Views
Dtype is ignored if _FillValue in encoding is provided for xr.to_zarr()? Data	0	413	August 9, 2022
Extremely slow rechunking of Zarr store with xarray Data	16	4161	October 22, 2021
Opening zarr on s3 gives different results Data	7	964	December 16, 2023
Writing to lat lon regions with to_zarr(region=) Data	10	2129	January 15, 2022
Welcome, I need some support for the design of a forecast archive with Zarr Data	10	1192	April 23, 2022

.to_zarr(..., compute=False) gives zeros not NaNs

Current Status

Where to go from here

Related topics