Variable Type changing when using Xarray

Hey all -
We’ve run into a precarious situation when trying to open data sets when using xarray.
Specifically - we notice in some operations the datatype of our data changes, which impacts its precision.
We have confirmed that our raw data is in the correct format.
In this specific example, the data values change from float32 to float64, which for the values of latitude and longitude, impacts the precision, meaning that some data retrieval is not consistent with the expected behavior.
The data can be downloaded from Github here.

Here is the snippet:

import xarray as xr

data = xr.open_dataset('test_slice.nc4')
print(data)
print(data.indexes['lat'])

And here’s the output:

<xarray.Dataset>
Dimensions:       (lat: 11, lon: 1, time: 1)
Coordinates:
  * time          (time) datetime64[ns] 2018-01-30T09:00:00
  * lat           (lat) float32 10.43 10.44 10.45 10.46 ... 10.51 10.52 10.53
  * lon           (lon) float32 -68.0
Data variables:
    analysed_sst  (time, lat, lon) float32 ...
Attributes: (12/47)
    Conventions:                CF-1.5
    title:                      Daily MUR SST, Final product
    summary:                    A merged, multi-sensor L4 Foundation SST anal...
    references:                 http://podaac.jpl.nasa.gov/Multi-scale_Ultra-...
    institution:                Jet Propulsion Laboratory
    history:                    created at nominal 4-day latency; replaced nr...
    ...                         ...
    project:                    NASA Making Earth Science Data Records for Us...
    publisher_name:             GHRSST Project Office
    publisher_url:              http://www.ghrsst.org
    publisher_email:            ghrsst-po@nceo.ac.uk
    processing_level:           L4
    cdm_data_type:              grid
Float64Index([10.430000305175781,   10.4399995803833, 10.449999809265137,
              10.460000038146973, 10.470000267028809, 10.479999542236328,
              10.489999771118164,               10.5, 10.510000228881836,
              10.520000457763672, 10.529999732971191],
             dtype='float64', name='lat')

My question is - has anyone seen this behavior before, and if so - do you know if there is a temporary workaround for this kind of behavior, or what can be done to fix this? I think this ultimately needs a long-term fix from xarray, but knowing how long that might take, I’m looking for any/all ideas that might help our less technical users.
If downloading the file doesn’t work, I can provide a stripped down version via text also.
Thank you in advance!

1 Like

Are you just seeing it with the coordinates? In your example it switches to float64 when you call indexes, which passes the coordinates through pandas, and that seems to be converting for some reason. With your netcdf file if I call data.lat then it stays as float32.
Could you use data.lat in place of data.indexes['lat']?

In [33]: data.lat
Out[33]: 
<xarray.DataArray 'lat' (lat: 11)>
array([10.43, 10.44, 10.45, 10.46, 10.47, 10.48, 10.49, 10.5 , 10.51, 10.52,
       10.53], dtype=float32)
Coordinates:
  * lat      (lat) float32 10.43 10.44 10.45 10.46 ... 10.5 10.51 10.52 10.53
Attributes:
    long_name:      latitude
    standard_name:  latitude
    axis:           Y
    units:          degrees_north
    valid_min:      -90.0
    valid_max:      90.0
    comment:        none

In [34]: data.lat.values
Out[34]: 
array([10.43, 10.44, 10.45, 10.46, 10.47, 10.48, 10.49, 10.5 , 10.51,
       10.52, 10.53], dtype=float32)

2 Likes

I agree with @sdtaylor.

All of the variables in your Dataset are (lon, lat, and analyzed_sst) are float32. Accessing the indexes directly (as in data.indexes['lat']) is not a recommended usage of Xarray. These indexes are used internally to make operations like ds.sel(lat=slice(10.44, 10.52)) work. The index type (Float64Index) actually comes from Pandas (pandas.Float64Index — pandas 1.3.3 documentation). In this case, promoting a float32 type to a float64 in the context of indexing should not affect selection operations, since float32 can safely be cast to float64

Could you give a specific example of how the use of a Float64Index for indexing creates a problem for your workflow?

2 Likes

Just noting that pandas 1.4 will support float32-dtype indexes: DOC: NumericIndex by topper-123 · Pull Request #42706 · pandas-dev/pandas · GitHub, if the promotion from float32 → float64 is a problem for precision (which I don’t quite see how it would, but I might be missing something)

2 Likes

Thank you both for your replies! The above example was just a super-small snippet that came out of my chat with someone working on xarray. I definitely have a real world example where this is causing an issue, it’s just a bit bigger of a chunk.

Using the same file as above:

import xarray as xr
longitude = -68.0

data = xr.open_dataset('test_slice.nc4')

ex_1 = data['analysed_sst'].sel(lat=slice(10.47, 10.52), lon=longitude).values[0]
ex_2 = data['analysed_sst'].sel(lat=slice(10.45, 10.52), lon=longitude).values[0]
ex_3 = data['analysed_sst'].sel(lat=slice(10.449, 10.52), lon=longitude).values[0]

print(ex_1)
print(ex_2)
print(ex_3)

The expected outcome would be:
ex_1 would have 5 values (for 10.47, 10.48, 10.49, 10.50, 10.51)
ex_2 would have 7 values (for 10.45, 10.46, 10.47, 10.48, 10.49, 10.50, 10.51)
ex_3 would have 7 values (for 10.45, 10.46, 10.47, 10.48, 10.49, 10.50, 10.51)

The actual output:

[    nan     nan     nan 299.485 299.46 ]
[    nan     nan     nan     nan 299.485 299.46 ]
[    nan     nan     nan     nan     nan 299.485 299.46 ]

ex_2 ends up having 6 values, whereas ex_3 has 7. We would expect ex_2 to have 7 as well.

@clynnes tagging chris - did our issue in this area ever get answered? seems like same problem we found.

@cgentemann this is the same issue, I work with/for Chris :slight_smile:

1 Like

Can you post a link to the actual file? What we are looking for is a short, simple, copy-pasteable example anyone can run which fully reproduces the problem. Right now, based on what you reported, it’s impossible to eliminate the possibility that this is a problem with the file encoding.

Chris and I have looked at the raw files through ncdump and h5dump and not seen any issues. However there’s a good chance I’m missing something so I can definitely provide the file.
What format are you looking for? I can reproduce this using open_dataset from our OPENDAP data, as well as from actual .nc4 files downloaded.
Would a dump of the actual .nc4 file be of use?
Let me know what works best. I can also link to the whole file if the above Github hyperlink isn’t working.

Can you post a link to the actual file?

Sorry, this comment was off base. Clearly you posted the original file in the original post! Sorry for the confusion…

I believe your problem boils down to the imprecision of specifying decimals such as 10.43 in floating point data types.

Display Precision

First let’s remind ourselves that numpy doesn’t always display all digits of precision when printing arrays.

import numpy as np
a = np.array([10.430000305175781], dtype='f4')
print(a)
# -> [10.43]

By default, numpy array values are rounded to a 8 digits of precision. (This can be customized.) Now let’s increase numpy’s display precision:

np.set_printoptions(precision=15, floatmode='fixed')
print(a)
# -> [10.430000305175781]

Same array, same bits, but different precision. In general, arbitrary smooth decimals can’t be represented exactly using float32 precision.

Your Data

Now let’s look at your data. To eliminate any chance that this is related to Xarray and it’s indexes, let’s start by just opening it directly with netCDF4 and displaying the lat values as a numpy array with the same high precision:

import pooch
import xarray as xr

url = 'https://github.com/pydata/xarray/files/7179549/test_slice.nc4.zip'
fname = pooch.retrieve(
    url, 
    known_hash='73f108768357a109ea47c5683536a05d2c49a80284a959ec7624cb1bc3377090',
    processor=pooch.Unzip()
)[0]

ds = netCDF4.Dataset(fname)
lat = ds['lat'][:].data
print(lat)
[10.430000305175781 10.439999580383301 10.449999809265137
 10.460000038146973 10.470000267028809 10.479999542236328
 10.489999771118164 10.500000000000000 10.510000228881836
 10.520000457763672 10.529999732971191]

and the same for the indexers you are specifying

test_array = np.array([10.449, 10.45, 10.47, 10.52], dtype='float32')
print(test_array)
[10.449000358581543 10.449999809265137 10.470000267028809
 10.520000457763672]

Now let’s check the comparison operations, keeping in mind that xarray sel is inclusive of both start and stop values.

for t in test_array[:-1]:
    print(sum((lat >= t) & (lat <= test_array[-1])))
# -> 8 8 6

:white_check_mark: This matches your expectations, modulo the inclusion of the 10.52 point as the stop value.

Now let’s take the same exact array values and use them in Xarray’s sel method:

ds_xr = xr.open_dataset(fname)
for t in test_array[:-1]:
    print(len(ds_xr.sel(lat=slice(t, test_array[-1])).lat.values))
# -> 8 8 6

:white_check_mark: Still consistent.

Next let’s use your exact slicing syntax with python (not Numpy) floats:

test_list = [10.449, 10.45, 10.47, 10.52]
for t in test_list[:-1]:
    print(len(ds_xr.sel(lat=slice(t, test_array[-1])).lat.values))
# -> 8 7 6

:x: Now the intermediate value has changed. This is the “error” you are concerned about. What’s going on?

We can reproduce this behavior without Xarray or Pandas by promoting everything to float64.

lat_64 = lat.astype('f8')
test_array_64 = np.array([10.449, 10.45, 10.47, 10.52], dtype='float64')
for t in test_array_64[:-1]:
    print(sum((lat_64 >= t) & (lat_64 <= test_array[-1])))
# -> 8 7 6

:tada: To me the gets to the root of the behavior. The number 10.45 is interpreted differently if it is float32 vs float64. Builtin python floats are 64-bit, so IMO Xarray does the right thing by promoting everything to float64 when making comparisons with your lat values. How can it know a priori that you meant only 32-bit precision when you typed 10.45? But I can see how behavior could feel inconsistent or arbitrary. Floating point data is just weird. :upside_down_face:

Most importantly, Xarray is consistent. If you use .sel with the exact float32 values from the array itself, not 4-digit decimal numbers you type into the python interpreter, it always gets the right answer:

index_array = ds_xr.lat.values
for t in index_array[:-1]:
    print(len(ds_xr.sel(lat=slice(t, test_array[-1])).lat.values))
# -> 10 9 8 7 6 5 4 3 2 1

I hope this helps build confidence in Xarray.

It’s possible that the new Pandas float32 indexes suggested by Tom will have a different behavior here (e.g. casting your indexer first to float32). I have not investigated that.

1 Like

This is a really thorough description of the issue and possible/proper ways to avoid it. The thing is, we’re not going through this exercise for ourselves, but rather trying to behave as we expect many of our EOSDIS data users might behave. And that slice shortcut Nick showed may just look too tempting/obvious relative to the more correct, but less concise formulation. That naturalness and terseness is one of the things that a lot of people like about xarray. Indeed, the float slice shortcut may have made its way into online tutorials already. So now that we know how to avoid it, how do we raise awareness among the community to not do it, esp. the newbies (like us :slightly_smiling_face:)?

1 Like

Agree :100: about potential user confusion. Here are some direct steps we could take

  • try out the new pandas numeric-index and see if it does the “right” thing
  • If yes, make PR to xarray to use a float32 index for float32 dimension coordinates