How does xarray feel about steps in a dimension

I had naively thought that one couldn’t have duplicated or out of order steps in a (for example) time dimension, but it seems you certainly can do that.

Is there a perspectives or writings on this? It seems fair that a rectilinear coordinate should only increase, but I suppose there’s good reason to allow multiple solutions sometimes?

The case that got me first was this one:

The situation I’m considering is the OISST time series which has “preliminary” files, that end up mixed in with “final” files, and a grouped order and distinct is enough to catch a nice monotonic series, but it seems xarray doesn’t care and you can load them all in. Is there a monotonic-strict mode?

There sure are netcdfs out there with duplicated or out of order “x” coordinates, but I always considered that those were properly broken.

I think about files in netcdf like I think about 1D track data, it doesn’t make sense if there are duplicated or out of order time steps - but maybe there’s a good reason to allow that in the general case of mixed or grouped dimensions?

this topic suggests to me that it’s “user-beware” at load time, and only enforced in workflows downstream: AODN_zarr.ipynb · GitHub

(I’m certainly going to normalized my file sets as I have elsewhere, but this seems like a gap atm)

Xarray does not care in general. Sortedness is mostly only useful for plotting and indexing.

You can use assert ds.indexes["time"].is_monotonic_increasing to assert properties that you want, for example.

2 Likes

ok cool, thanks! Same with duplication? Sortedness and avoiding duplication is certainly useful for validation and preventing error propagation (and xarray can’t do everything, I can see it being useful for lining up datasets for downstream use). I don’t want to make Zarrs that mirror the old mess in netcdf, so if anyone has pointers to how they avoid this I’m interested.

I’ll do normalizing upstream.

monotonic increasing doesn’t help with duplicates (as documented), this from a different dataset:

ds.time[16044:16046].values
array(['2024-10-20T12:00:00.000000000', '2024-10-20T12:00:00.000000000'],
      dtype='datetime64[ns]')
ds.indexes["time"].is_monotonic_increasing
True

index.is_unique and index.is_monotonic_increasing will give you strict monotonicity.

Pandas has (experimental) support for disallowing duplicate labels on indexes attached to a DataFrame or Series: Duplicate Labels — pandas 2.2.3 documentation. I suspect that’s not (yet) exposed xarray DataArrays and Datasets.

1 Like

Cool thanks definitely want something for that upfront.

Also just now I see open_mfdataset does de-duplicate and order netcdfs at read time (so it is a reasonable expectation and I need to make sure my .concat input gets cleaned up first).

and I see .concat() probably has handlers for this … thanks! I needed a push.

I think the behaviour you’re referring to is specific to the combine='by_coords' option FYI, because that internally checks the indexes are monotonic.

Xarray does have xarray.DataArray.drop_duplicates