I had naively thought that one couldn’t have duplicated or out of order steps in a (for example) time dimension, but it seems you certainly can do that.
Is there a perspectives or writings on this? It seems fair that a rectilinear coordinate should only increase, but I suppose there’s good reason to allow multiple solutions sometimes?
The case that got me first was this one:
The situation I’m considering is the OISST time series which has “preliminary” files, that end up mixed in with “final” files, and a grouped order and distinct is enough to catch a nice monotonic series, but it seems xarray doesn’t care and you can load them all in. Is there a monotonic-strict mode?
There sure are netcdfs out there with duplicated or out of order “x” coordinates, but I always considered that those were properly broken.
I think about files in netcdf like I think about 1D track data, it doesn’t make sense if there are duplicated or out of order time steps - but maybe there’s a good reason to allow that in the general case of mixed or grouped dimensions?
this topic suggests to me that it’s “user-beware” at load time, and only enforced in workflows downstream: AODN_zarr.ipynb · GitHub
(I’m certainly going to normalized my file sets as I have elsewhere, but this seems like a gap atm)
ok cool, thanks! Same with duplication? Sortedness and avoiding duplication is certainly useful for validation and preventing error propagation (and xarray can’t do everything, I can see it being useful for lining up datasets for downstream use). I don’t want to make Zarrs that mirror the old mess in netcdf, so if anyone has pointers to how they avoid this I’m interested.
I’ll do normalizing upstream.
monotonic increasing doesn’t help with duplicates (as documented), this from a different dataset:
index.is_unique and index.is_monotonic_increasing will give you strict monotonicity.
Pandas has (experimental) support for disallowing duplicate labels on indexes attached to a DataFrame or Series: Duplicate Labels — pandas 2.2.3 documentation. I suspect that’s not (yet) exposed xarray DataArrays and Datasets.
Cool thanks definitely want something for that upfront.
Also just now I see open_mfdataset does de-duplicate and order netcdfs at read time (so it is a reasonable expectation and I need to make sure my .concat input gets cleaned up first).