Nodatavals attribute in geoTIFFs and xarray

Hi folks,

noob here again.

I read in geoTIFFs using xarray’s open_rasterio. I notice that the nodatavals attribute is always a tuple of length one, e.g. (nan,) or (-9999.0,). I also only deal (so far) with single band TIFFs. I have searched and, if anything, concluded that there’s not really an official standard here. So, questions I’d really appreciate people’s guidance on are:

  • Is it a tuple to deal with the general case of n-bands in the input file, and allow for different nodatavals per band?
  • Is it a tuple to deal with multiple nodataval encodings within a single band? (Is this possible? Illegal? Dumb?)
  • This flows through to nodatavals being a tuple (of length 1) in the xarray DataArray’s attribute. I’m preserving this as a tuple, but could I (should I?) drop the tuple and just retain the scalar nodataval for the attribute?
  • Is there anything special about the nodatavals attribute in xarray? (xarray methods do seem keen to drop attributes, in general, so I’m guessing this isn’t critical at all).
  • I’m hazy on this (it’s been a few months now), but some people can create geoTIFFs with no nodatavals tag/metadata. I think open_rasterio defaults to assigning the nodatavals attribute to (nan,) in that case. In the past (several months ago), I was detecting the (nan,) and correcting the data based on what the user had told me was actually used to encode nodata. Given that (nan,) can be a genuine encoding for nodata, is there a better way of detecting that the geoTIFF creator didn’t supply this tag? Note, an acceptable answer is “No, you’d just need to know and explicitly override on a file by file basis.”

This is part of my own IO routine that ingests multiple geoTIFFs and stacks them, so by the time they’re stacked I’d like to have a consistent, single, notion of “nodata” (i.e. np.nan).

Thanks,
Guy

I don’t have any answers to your questions, @Guy_Maskall, but I’m watching this thread for the answers! As best I recall, different xarray extensions seem to handle/use (or not) the nodatavals attribute differently, so I ended up just doing some of my own nodataval handling to stack raster data.

1 Like

Following up on this. The pangeo community has a reputation for being friendly and helpful, and whilst I’m not using pangeo as “a thing”, I do appreciate the community. I understand it spands the gamut of tools such as xarray and dask and data formats such as geoTIFF. My question above spans geoTIFF specification (which, from my researches, doesn’t seem to really mandate a format for nodata metadata?) and how xarray handles this attribute.

So, the crux of this follow up, whilst people might not have the specific answer here, could anyone suggest resources that document geoTIFF or point me to some dark, undiscovered corner of xarray (open_rasterio) and how it handles this, and/or how such metadata is handled for single band geoTIFFs and multiband geoTIFFs etc? Any general pointers welcome. I’m just not sure where to target because I feel it spans a number of relevant concepts.

Thanks.

@Guy_Maskall thanks for the questions. A lot of work has gone into the rioxarray library with the intention of eventually replacing xarray’s experimental open_rasterio() function. As you’ve observed geotiffs can have multiple bands with different nodatavals, thus the tuple in xarray attributes. A detailed reference on the format and various options is here GTiff – GeoTIFF File Format — GDAL documentation. It’s unfortunately also very common to come across GeoTIFFS where the nodata value is not specified in the metadata, so you have set it via xarray or np.nan after opening a file. I’d recommend having a look at Nodata Management — rioxarray 0.3.1 documentation . Best case scenario, if the nodata value is correctly specified in geotiff metadata you can do:

import rioxarray 
da = rioxarray.open_rasterio('image.tif', masked=True)
1 Like

This is a most helpful response. Thanks @scottyhq .

But the first link says “Note that all bands must use the same nodata value.” So I’m still not sure about the tuple. We pretty much only ever deal with single band data, but I wanted to be able to extend my function to cope with multi-band gTIFFs. If they were multi-band, would they have to have the same nodata value, and so the nodataval attribute would be a tuple of length > 1, but just all the same value? For now it’s not really an issue. I mostly wanted to try to get a handle on when my approach of taking the first value in the tuple would break. :wink: And it looks like using rioxarray.open_rasterio is the way to go going forwards. Once I’ve completed writing unit tests for my current code, of course!

I’ll stick with xarray’s function for now. I suspect that function defaults to a nodata value of nan if it’s not specified in the file metadata(?) As you say, it’s not uncommon and I had this with some internal data, and I’m pretty sure the DataArray nodataval attribute was present in the DataArray, but set to (np.nan,), despite the files using 0. This was originally why I got interested in this question because I wanted to read in multiple files (that use different nodata value encodings) so wrote a function to make them all consistent. When I come back to expanding functionality here, I’ll switch to rioxarray’s implementation and worry more deeply about dealing with tuples and multi-band gTIFFs then.

1 Like