Xarray trouble decoding NetCDF with compressed integers

Hi all, I just came across a curious case where xarray is throwing this error on a compressed integer stored in a NetCDF4:

RuntimeWarning: overflow encountered in scalar absolute
vlim = max(abs(vmin - center), abs(vmax - center))

Here is the encoding as seen by xarray:
{‘dtype’: dtype(‘int16’),
‘zlib’: True,
‘szip’: False,
‘zstd’: False,
‘bzip2’: False,
‘blosc’: False,
‘shuffle’: False,
‘complevel’: 1,
‘fletcher32’: False,
‘contiguous’: False,
‘chunksizes’: (151, 79, 118),
‘preferred_chunks’: {‘time’: 151, ‘lat’: 79, ‘lon’: 118},
‘original_shape’: (151, 474, 944),
‘missing_value’: -999,
‘_FillValue’: -999}

Xarray doesn’t seem to be correctly decoding and applying the missing value -999. As a side note, Panoply does. Here the ncdump shown in Panoply:

short CDD(time=151, lat=474, lon=944);
:missing_value = -999S; // short
:_FillValue = -999S; // short
:long_name = “Consecutive Dry Days”;
:units = “days”;
:_ChunkSizes = 151U, 79U, 118U; // uint

xarray is loading the variable as an int64, but since isn’t catching the missing_value correctly the missing grid cells (ie ocean and lake bodies in these data) are loading as -9223372036854775808, which is messing up spatial averaging by not being represented as a NaN. My other compressed variables that have a scale_factor and add_offset are correctly being decoded into floats with missing_value NaNs.

Is the missing_value = -999S as a short forcing xarray to load the variable as an integer rather than a float with NaNs?

This is publicly released data, so I can’t change the source NetCDF files, but maybe I can write a xarray preprocess function to correct the typing.

Thanks for any suggestions.

Can you share some example data?

Sure, the data release is from here. There are many files, but the CMIP6-LOCA2_Thresholds_AllModels_grid_R3in.tar.gz file is the smallest and demonstrates the compressed integer issue.

import numpy as np
import xarray as xr

sample_file = '/Volumes/head4/published_projects/ScienceBase_Alder_2024_CMIP6-LOCA2_Thresholds/CMIP6-LOCA2_Thresholds_AllModels_grid/R3in/CMIP6-LOCA2_Thresholds_R3in_ACCESS-CM2.ssp245.r1i1p1f1_1950-2100_16thdeg_grid.nc'
ds = xr.open_dataset(sample_file, decode_times=False)

ds.R3in.isel(time=0).plot()

print(ds.R3in[0,0,0].data)

Which prints -9223372036854775808, being the xarray processed value for water bodies in these data. The missing_value is -999S, but xarray is still loading it as int64 rather than a double or a float that can support NaNs.

Thanks for sharing! Hopefully someone can look into this. Sounds like an issue for the Xarray issue tracker.

Perhaps one unintentional takeway from this thread is that a .tar.gz file containing NetCDFs is not a particularly accessible or easy way to share and distribute data! :laughing: I wanted to investigate this myself, but I gave up when I saw the amount of friction ahead of me to get to an actual Xarray dataset. (Lazy I know, but that’s probably typical.)