The GHRSST NetCDF product (Multi-scale_Ultra-high_Resolution_MUR-SST) has a number of problems:
- units are Kelvin, this destroys lazy load strategies with the need to recalculate having to be pushed all the way to the user experience
- the degenerate rectilinear storage of lon,lat arrays in NetCDF is better expressed as an extent (bbox, or if you like a 4-digit geotransform - offset,scale implicit bbox when combined with the dimensions ncol,nrow), and this is not cleanly derivable from the values in the netcdf
- the coord arrays are 32-bit float, so they appear to be slightly irregular but this is obviously a mistake
- they are pretty big arrays 36000x17999 Int16 (there are also Byte array masks and sea ice fraction) and there’s no overviews (precomputed pyramids for fast read of large areas at lower resolution)
- the offset and scale of the data values Int16 converts them to Kelvin (edit: we might just convert to float, though)
- there’s no real use of the 3rd dimension in the netcdf (time) it’s just a time stamp for the array
My solution is this:
- write as COGs with overviews using average resampling
- use the extent (the extreme edges of area-pixels) xmin,xmax,ymin,ymax -180,180,-89.9905,89.9905 (this appears to be the intention, one missing half-row at the north and south margins)
- write the same values as Int16 with ZSTD or DEFLATE compression but override the offset scale metadata so they read out as Celsius
I’m a very bad Python coder but I’m getting ok with it, the important snippet for the extent and scaling fix is coded here:
The code is just a draft, I’m still using it with local testing - it’s the grid spec logic I’m interested in discussing and not the deployment details of this.
The vrt:// protocol syntax there requires GDAL >= 3.7.
There’s no problem doing this in either R or Python but we’ll see how that goes - Python has a real GDAL API and R does not, even though it can be done with the partial pieces available - it seems sensible to frame this in Python terms here but the performance per file would be exactly the same with GDAL doing the actual work. I originally drafted this with the warper API to subsample to full global extent 36000x18000 but that seems problematic to justify, and I think this reframing of the extent is probably correct.
Alternative workflows with odc/xarray don’t seem as straightforward to me but that might just be my discomfort with the tools themselves.
This problem has plagued me and colleagues for years, and now it just seems trivial to create a copy and set up a workflow to maintain it (there’s 7000+ days since June 2002 and so it’s ~2.5Tb in total now) - not huge in today’s context but enough to prevent an individual doing it for themselves.
I’m interested in opinions or anything glaring I’m missing. I’ve searched a long time for any real discussion about this dataset and its issues and haven’t found anything. It’s extremely hard to use in its current form and COG set on cloud storage would be fantastic.
Do you think the top edge should be aligned to 90? Or otherwise different to my take?