Fixing GHRSST - seeking opinions

Michael_Sumner · November 10, 2023, 10:41pm

The GHRSST NetCDF product (Multi-scale_Ultra-high_Resolution_MUR-SST) has a number of problems:

units are Kelvin, this destroys lazy load strategies with the need to recalculate having to be pushed all the way to the user experience
the degenerate rectilinear storage of lon,lat arrays in NetCDF is better expressed as an extent (bbox, or if you like a 4-digit geotransform - offset,scale implicit bbox when combined with the dimensions ncol,nrow), and this is not cleanly derivable from the values in the netcdf
the coord arrays are 32-bit float, so they appear to be slightly irregular but this is obviously a mistake
they are pretty big arrays 36000x17999 Int16 (there are also Byte array masks and sea ice fraction) and there’s no overviews (precomputed pyramids for fast read of large areas at lower resolution)
the offset and scale of the data values Int16 converts them to Kelvin (edit: we might just convert to float, though)
there’s no real use of the 3rd dimension in the netcdf (time) it’s just a time stamp for the array

My solution is this:

write as COGs with overviews using average resampling
use the extent (the extreme edges of area-pixels) xmin,xmax,ymin,ymax -180,180,-89.9905,89.9905 (this appears to be the intention, one missing half-row at the north and south margins)
write the same values as Int16 with ZSTD or DEFLATE compression but override the offset scale metadata so they read out as Celsius

I’m a very bad Python coder but I’m getting ok with it, the important snippet for the extent and scaling fix is coded here:

github.com

mdsumner/ghrsst.coop/blob/3c1f3a6d94f2993239324b676f632499d370931e/do_ghrsst.py#L30C82-L30C107


      
          dsn = f"vrt://NetCDF:\"{filename}\":{sds}?a_srs=OGC:CRS84&a_ullr=-180,89.9905,180,-89.9905&a_scale=0.001&a_offset=25"

The code is just a draft, I’m still using it with local testing - it’s the grid spec logic I’m interested in discussing and not the deployment details of this.

The vrt:// protocol syntax there requires GDAL >= 3.7.

There’s no problem doing this in either R or Python but we’ll see how that goes - Python has a real GDAL API and R does not, even though it can be done with the partial pieces available - it seems sensible to frame this in Python terms here but the performance per file would be exactly the same with GDAL doing the actual work. I originally drafted this with the warper API to subsample to full global extent 36000x18000 but that seems problematic to justify, and I think this reframing of the extent is probably correct.

Alternative workflows with odc/xarray don’t seem as straightforward to me but that might just be my discomfort with the tools themselves.

This problem has plagued me and colleagues for years, and now it just seems trivial to create a copy and set up a workflow to maintain it (there’s 7000+ days since June 2002 and so it’s ~2.5Tb in total now) - not huge in today’s context but enough to prevent an individual doing it for themselves.

I’m interested in opinions or anything glaring I’m missing. I’ve searched a long time for any real discussion about this dataset and its issues and haven’t found anything. It’s extremely hard to use in its current form and COG set on cloud storage would be fantastic.

Do you think the top edge should be aligned to 90? Or otherwise different to my take?

Thanks!

dcherian · November 10, 2023, 10:56pm

What workflows are you trying to execute with this data in the end?

Michael_Sumner · November 10, 2023, 11:18pm

general access, point and trajectory extraction, array summaries, we use sst routinely for animal tracking, ecosystem modelling, and comparison with shipboard measures

I personally don’t do analyses much anymore … I’m just supporting science users and collaborators at our institution. our current tooling works fine but it would be faster and publically accessible for this dataset very easily in a cloud COG form.

Michael_Sumner · March 19, 2024, 8:10am

I have this to the point where any of the GHRSST URLs can be templated like this, to generate the COG that I think has the right interpretation, and it reads out as Celsius without having to convert or unscale the data from its native integer type.

dsn = f"vrt://NetCDF:\"/vsicurl/{input_path}\":analysed_sst?a_ullr=-180,89.995,180,-89.995&a_offset=25&a_scale=0.001&a_srs=EPSG:4326"

Where input_path looks like

https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc

You need to have the GDAL_HTTP_HEADERS config set, which is the contents of the EARTHDATA_TOKEN in a form like this:

header = f"Authorization: Bearer {os.environ['EARTHDATA_TOKEN']}"

Then any GDAL tool (since 3.8.0) can read/stream/convert from it as if it were a nice geospatial raster, R, Python, Julia, QGIS, C++, cli … In later GDAL we can simplify the subdataset (variable) choice syntax, but I’ve left it at current release support.

This is a small step forward for us, I’m hoping to come back and ask the questions above in more detail, to find maybe what others would do when faced with this dataset, and its grid specification ambiguities, before we commit to an actual conversion.

Michael_Sumner · September 6, 2024, 7:50pm

the chunk sizes are also weird (2047,1023) and inconsistent, at “2023-04-22 UTC” the chunk sizes of analysed_sst are 7200, 3600 and again between “2023-11-09” to “2024-06-09”.

(pointed out in a VirtualiZarr meeting notes, I reported back to the providers, EDIT: already reported
https://forum.earthdata.nasa.gov/viewtopic.php?t=5909)

I think analysis_error is the same, mask and sea_ice_fraction are different again

using this code to assesss against our local copy

library(furrr)
library(ncdf4)

chunk <- function(x, var) {nc <- ncdf4::nc_open(x); out <- nc$var[[var]]$chunksizes; ncdf4::nc_close(nc); out}

plan(multicore)

## analysed_sst, analysis_error, mask, sea_ice_fraction
x <- do.call(rbind, future_map(files$fullname, chunk, "analysis_error"))  ## etc
plan(sequential)

We’ve got a lambda task running daily to produce COGs, it’s not public yet and we haven’t run the backlog but hoping to do both of those soon.

Michael_Sumner · September 8, 2024, 8:18am

One change we need to report above is that -180,180 is not the right extreme margins, the (GDAL) transform should be

-179.995    0.010    0.000   89.995    0.000   -0.010

giving an overall extent (at cell edges)

-179.995  180.005  -89.995   89.995

The 32-bit float coords is surely trying to represent that detail, but fails because of the numeric imprecision. This actually means that GDAL went through an incorrect pathway when using the NetCDFs directly with its rectilinear geolocation array mechanism and broke some antimeridian logic, fixed now and documented here:

(but still, it seems better to virtualize the extent with vrt or fix the files).

Michael_Sumner · March 23, 2025, 9:48pm

we’re confident about our take on the weird grid, it really does sit over the anti-meridian and has a single row missing (half each at top/bottom)

COGs published here, updated regularly, there’s a STAC index in Parquet as well: Source Cooperative - Source Cooperative

Topic		Replies	Views
Need guidance on manipulating NetCDF files Data	4	344	April 22, 2024
Example which highlights the limitations of NetCDF-style coordinates for large geospatial rasters	46	2246	July 2, 2025
Writing large Xarray datasets to NetCDF Science	7	2386	September 21, 2023
Netcdf to Zarr best practices Data	13	10287	February 10, 2021
Xarray and compression options for large NetCDF files Data	8	3775	March 8, 2022

Fixing GHRSST - seeking opinions

Related topics