Recommendation for hosting cloud-optimized data

Apologies for not updating with the new Zarr files to match the NetCDF chunking. I ran into Debugging memory issues in IMERG · Issue #227 · pangeo-forge/pangeo-forge-recipes · GitHub and haven’t had a chance to dig into it.

I’m glad to hear that others have already done similar benchmarks and have reached the conclusion that performance is roughly similar (which make sense, but it’s good to see confirmed).

[Rich] Do we know how many concurrent requests it would take before we would see less than linear scaling?

https://docs.microsoft.com/en-us/azure/storage/blobs/scalability-targets has some numbers. We haven’t observed slowdowns, but we / our users have hit the limits off the blob storage service, and so have had to add retry logic to their applications.

[Ryan] the primary data store on the cloud should be the original netCDF files + a kerchunk index .

I like this recommendation, primarily because it means I’m not in the awkward position of claiming we host dataset X, when it’s actually a somewhat different (but better!) cloud-optimzied version.

[Ryan] How should a sequence of netCDF files + kernchunk index be represented in STAC? Does each individual netCDF file need to go into the STAC catalog? Or does the kerchunk index effectively serve as a catalog for the individual files?

Anything is possible, but I one factor is whether you’re exposing both the
netCDF files and the Kerchunk index through the STAC API. If you’re making
both available (as I think you should), I’d recommend:

  1. Use STAC Items + assets to catalog the netCDF files.
  2. Use collection-level assets (or a separate STAC collection) for Kerchunk
    index files

Otherwise, you risk searches returning “duplicate” matches, one for the regular
netCDF files and one for the index file.

Do we need a formal kerchunk STAC extension?

I’m not sure yet. I’ve found the STAC extensions helpful to enable tools like
intake-stac be able to programatically, and safely, go from STAC → Dataset.

Currently, loading these kerchunk-style datasets looks like:

import requests

asset = collection.assets["ACCESS-CM2.historical"]

references = requests.get(asset.href).json()

reference_filesystem = fsspec.filesystem("reference", fo=references)
reference_filesystem = planetary_computer.sign(reference_filesystem)

ds = xr.open_dataset(
    reference_filesystem.get_mapper("/"),
    **asset.extra_fields["xarray:open_dataset_kwargs"]
)

The new thing here is that intermediate requests.get(...) to get the
references (I think we can’t quite just pass the URL, since the Azure Blob
Storage URLs need to be signed, but other groups might be able to skip that).
So the only additional thing needed might be a flag in the xarray metadata
indicating that it’s a kerchunk file / reference filesystem, so that tools
like intake-stac know what to do.

[Ryan] Can we use checksumming to verify the file integrity?

There’s a few places we might want to do that:

  1. From the original data provider (e.g. a THREDDS server) to Blob Storage. If
    the data provider provides checksums, we should validate against them.
  2. The kerchunk index: to verify that the data read via kerchunk / reference
    filesystem is identical to the data you would have gotten via the NetCDF file

On this second item, what exactly would we be checksumming / comparing? That the
raw bytes of the NetCDF variable is identical to what we’d we read via kerchunk?
Or is the range request made by Kerchunk just a subset of the variable’s data
(skipping a header or footer, for example).

As for storing those checksum values, I think either the STAC metadata, or
perhaps within the Kerchunk index file itself would be fine.