Fixing GHRSST - seeking opinions

The GHRSST NetCDF product (Multi-scale_Ultra-high_Resolution_MUR-SST) has a number of problems:

  • units are Kelvin, this destroys lazy load strategies with the need to recalculate having to be pushed all the way to the user experience
  • the degenerate rectilinear storage of lon,lat arrays in NetCDF is better expressed as an extent (bbox, or if you like a 4-digit geotransform - offset,scale implicit bbox when combined with the dimensions ncol,nrow), and this is not cleanly derivable from the values in the netcdf
  • the coord arrays are 32-bit float, so they appear to be slightly irregular but this is obviously a mistake
  • they are pretty big arrays 36000x17999 Int16 (there are also Byte array masks and sea ice fraction) and there’s no overviews (precomputed pyramids for fast read of large areas at lower resolution)
  • the offset and scale of the data values Int16 converts them to Kelvin (edit: we might just convert to float, though)
  • there’s no real use of the 3rd dimension in the netcdf (time) it’s just a time stamp for the array

My solution is this:

  • write as COGs with overviews using average resampling
  • use the extent (the extreme edges of area-pixels) xmin,xmax,ymin,ymax -180,180,-89.9905,89.9905 (this appears to be the intention, one missing half-row at the north and south margins)
  • write the same values as Int16 with ZSTD or DEFLATE compression but override the offset scale metadata so they read out as Celsius

I’m a very bad Python coder but I’m getting ok with it, the important snippet for the extent and scaling fix is coded here:

The code is just a draft, I’m still using it with local testing - it’s the grid spec logic I’m interested in discussing and not the deployment details of this.

The vrt:// protocol syntax there requires GDAL >= 3.7.

There’s no problem doing this in either R or Python but we’ll see how that goes - Python has a real GDAL API and R does not, even though it can be done with the partial pieces available - it seems sensible to frame this in Python terms here but the performance per file would be exactly the same with GDAL doing the actual work. I originally drafted this with the warper API to subsample to full global extent 36000x18000 but that seems problematic to justify, and I think this reframing of the extent is probably correct.

Alternative workflows with odc/xarray don’t seem as straightforward to me but that might just be my discomfort with the tools themselves.

This problem has plagued me and colleagues for years, and now it just seems trivial to create a copy and set up a workflow to maintain it (there’s 7000+ days since June 2002 and so it’s ~2.5Tb in total now) - not huge in today’s context but enough to prevent an individual doing it for themselves.

I’m interested in opinions or anything glaring I’m missing. I’ve searched a long time for any real discussion about this dataset and its issues and haven’t found anything. It’s extremely hard to use in its current form and COG set on cloud storage would be fantastic.

Do you think the top edge should be aligned to 90? Or otherwise different to my take?


What workflows are you trying to execute with this data in the end?

general access, point and trajectory extraction, array summaries, we use sst routinely for animal tracking, ecosystem modelling, and comparison with shipboard measures

I personally don’t do analyses much anymore … I’m just supporting science users and collaborators at our institution. our current tooling works fine but it would be faster and publically accessible for this dataset very easily in a cloud COG form.

I have this to the point where any of the GHRSST URLs can be templated like this, to generate the COG that I think has the right interpretation, and it reads out as Celsius without having to convert or unscale the data from its native integer type.

dsn = f"vrt://NetCDF:\"/vsicurl/{input_path}\":analysed_sst?a_ullr=-180,89.995,180,-89.995&a_offset=25&a_scale=0.001&a_srs=EPSG:4326"

Where input_path looks like

You need to have the GDAL_HTTP_HEADERS config set, which is the contents of the EARTHDATA_TOKEN in a form like this:

header = f"Authorization: Bearer {os.environ['EARTHDATA_TOKEN']}"

Then any GDAL tool (since 3.8.0) can read/stream/convert from it as if it were a nice geospatial raster, R, Python, Julia, QGIS, C++, cli … In later GDAL we can simplify the subdataset (variable) choice syntax, but I’ve left it at current release support.

This is a small step forward for us, I’m hoping to come back and ask the questions above in more detail, to find maybe what others would do when faced with this dataset, and its grid specification ambiguities, before we commit to an actual conversion. :pray: