The GHRSST NetCDF product (Multi-scale_Ultra-high_Resolution_MUR-SST) has a number of problems:
units are Kelvin, this destroys lazy load strategies with the need to recalculate having to be pushed all the way to the user experience
the degenerate rectilinear storage of lon,lat arrays in NetCDF is better expressed as an extent (bbox, or if you like a 4-digit geotransform - offset,scale implicit bbox when combined with the dimensions ncol,nrow), and this is not cleanly derivable from the values in the netcdf
the coord arrays are 32-bit float, so they appear to be slightly irregular but this is obviously a mistake
they are pretty big arrays 36000x17999 Int16 (there are also Byte array masks and sea ice fraction) and thereās no overviews (precomputed pyramids for fast read of large areas at lower resolution)
the offset and scale of the data values Int16 converts them to Kelvin (edit: we might just convert to float, though)
thereās no real use of the 3rd dimension in the netcdf (time) itās just a time stamp for the array
My solution is this:
write as COGs with overviews using average resampling
use the extent (the extreme edges of area-pixels) xmin,xmax,ymin,ymax -180,180,-89.9905,89.9905 (this appears to be the intention, one missing half-row at the north and south margins)
write the same values as Int16 with ZSTD or DEFLATE compression but override the offset scale metadata so they read out as Celsius
Iām a very bad Python coder but Iām getting ok with it, the important snippet for the extent and scaling fix is coded here:
The code is just a draft, Iām still using it with local testing - itās the grid spec logic Iām interested in discussing and not the deployment details of this.
The vrt:// protocol syntax there requires GDAL >= 3.7.
Thereās no problem doing this in either R or Python but weāll see how that goes - Python has a real GDAL API and R does not, even though it can be done with the partial pieces available - it seems sensible to frame this in Python terms here but the performance per file would be exactly the same with GDAL doing the actual work. I originally drafted this with the warper API to subsample to full global extent 36000x18000 but that seems problematic to justify, and I think this reframing of the extent is probably correct.
Alternative workflows with odc/xarray donāt seem as straightforward to me but that might just be my discomfort with the tools themselves.
This problem has plagued me and colleagues for years, and now it just seems trivial to create a copy and set up a workflow to maintain it (thereās 7000+ days since June 2002 and so itās ~2.5Tb in total now) - not huge in todayās context but enough to prevent an individual doing it for themselves.
Iām interested in opinions or anything glaring Iām missing. Iāve searched a long time for any real discussion about this dataset and its issues and havenāt found anything. Itās extremely hard to use in its current form and COG set on cloud storage would be fantastic.
Do you think the top edge should be aligned to 90? Or otherwise different to my take?
general access, point and trajectory extraction, array summaries, we use sst routinely for animal tracking, ecosystem modelling, and comparison with shipboard measures
I personally donāt do analyses much anymore ⦠Iām just supporting science users and collaborators at our institution. our current tooling works fine but it would be faster and publically accessible for this dataset very easily in a cloud COG form.
I have this to the point where any of the GHRSST URLs can be templated like this, to generate the COG that I think has the right interpretation, and it reads out as Celsius without having to convert or unscale the data from its native integer type.
Then any GDAL tool (since 3.8.0) can read/stream/convert from it as if it were a nice geospatial raster, R, Python, Julia, QGIS, C++, cli ⦠In later GDAL we can simplify the subdataset (variable) choice syntax, but Iāve left it at current release support.
This is a small step forward for us, Iām hoping to come back and ask the questions above in more detail, to find maybe what others would do when faced with this dataset, and its grid specification ambiguities, before we commit to an actual conversion.
the chunk sizes are also weird (2047,1023) and inconsistent, at ā2023-04-22 UTCā the chunk sizes of analysed_sst are 7200, 3600 and again between ā2023-11-09ā to ā2024-06-09ā.
One change we need to report above is that -180,180 is not the right extreme margins, the (GDAL) transform should be
-179.995 0.010 0.000 89.995 0.000 -0.010
giving an overall extent (at cell edges)
-179.995 180.005 -89.995 89.995
The 32-bit float coords is surely trying to represent that detail, but fails because of the numeric imprecision. This actually means that GDAL went through an incorrect pathway when using the NetCDFs directly with its rectilinear geolocation array mechanism and broke some antimeridian logic, fixed now and documented here:
(but still, it seems better to virtualize the extent with vrt or fix the files).
I wanted to make a reprex of that from the source but struggling to apply the earthdata auth rn. This indexing capability is exciting! Huge congrats to everyone who made it happen.