Recommendation for hosting cloud-optimized data

I’m trying to come up with some clear guidance for how to provide and access labeled n-dimensonal data from Blob Storage. I’ll post a few thoughts below, but would appreciate some feedback from the community on what they think is the best path forward.

For context, we (the Planetary Computer team, but others too I’m sure) are often provided data files as NetCDF / HDF5 by our partners. Accessing NetCDF / HDF5 files from blob storage isn’t great, as described in Matt’s post: HDF in the Cloud.
So at the moment, we’re choosing between converting all of that data to Zarr or providing “kerchunk / reference files” (is there a succinct name to give this?) to enable high-performance access in the cloud. I think most things we say here about Zarr applies equally well to formats like TileDB.

To help make an informed choice, I’ve taken a collection of NetCDF files from NASA-NEX-GDDP-CMIP6 and created

  1. A Zarr store
  2. A Kerchunk / reference filesystem

I’ll clean up the scripts used to do the conversion and will post those here. I’ve belatedly realized that I didn’t take care to ensure the compression options of the Zarr store matched the original data, which might throw of some of the timings.

EDIT: I messed up the chunking in the Zarr files, so they don’t match the (internal) chunking of the NetCDF files. So these benchmarks are not meaningful.

The raw data

Each original NetCDF file contains a single float32 data variable with dimensions (time, lat, lon) and shape (365, 600, 1440).

<xarray.Dataset>
Dimensions:  (time: 365, lat: 600, lon: 1440)
Coordinates:
  * time     (time) datetime64[ns] 1950-01-01T12:00:00 ... 1950-12-31T12:00:00
  * lat      (lat) float64 -59.88 -59.62 -59.38 -59.12 ... 89.38 89.62 89.88
  * lon      (lon) float64 0.125 0.375 0.625 0.875 ... 359.1 359.4 359.6 359.9
Data variables:
    pr       (time, lat, lon) float32 ...
Attributes: (12/22)
    activity:              NEX-GDDP-CMIP6
    contact:               Dr. Rama Nemani: rama.nemani@nasa.gov, Dr. Bridget...
    Conventions:           CF-1.7
    creation_date:         2021-10-04T13:59:54.607947+00:00
    frequency:             day
    institution:           NASA Earth Exchange, NASA Ames Research Center, Mo...
    ...                    ...
    history:               2021-10-04T13:59:54.607947+00:00: install global a...
    disclaimer:            This data is considered provisional and subject to...
    external_variables:    areacella
    cmip6_source_id:       ACCESS-CM2
    cmip6_institution_id:  CSIRO-ARCCSS
    cmip6_license:         CC-BY-SA 4.0

There are typically 9 data variables per year (each in their own NetCDF files, but sharing the same dimensions). This dataset runs from 1950 - 2014 (inclusive), for a total of 585 NetCDF files. We want to combine all of those into one logical dataset.

Both Zarr and the kerchunk / reference filesystem achieve our original goal: we can quickly read the metadata for these collections in ~1 second (as a rough estimate, it would take ~10 minutes to read the metadata for all the NetCDF files. All the data lives in Blob Storage, and the data and compute are both in Azure’s West Europe region.

To get a sense for performance, I compared some indexing operations (selecting a point, a chunk, a timeseries, …) for a single variable and for all 9 variables (full code in the notebook). Here are the timings in tabular form:

source operation time
zarr all variables-point 19.9646
references all variables-point 0.189757
zarr all variables-chunk (partial) 20.2704
references all variables-chunk (partial) 11.8032
zarr all variables-chunk (full) 20.0788
references all variables-chunk (full) 51.4636
zarr single variable-point 6.8371
references single variable-point 0.0267782
zarr single variable-chunk (partial) 6.90783
references single variable-chunk (partial) 1.37082
zarr single variable-chunk (full) 6.92691
references single variable-chunk (full) 5.73868
zarr single variable-timeseries (point) 137.09
references single variable-timeseries (point) 308.311

and graphically


Some thoughts:

  1. I really appreciate the goal of kerchunk / reference filesystem. The data providers don’t have to upgrade their systems to produce a new file format. We, the hosts, don’t have to go through a sometimes challenging process to do the conversion to Zarr (though that’s getting easier with pangeo-forge). But we still get cloud-friendly access.
  2. I worry a bit about pushing a new spec / system for accessing data. But that said, the original NetCDF files are there as a fallback.
  3. The libraries for generating and reading these reference files are young, missing features, and sometimes buggy. But all of that can be solved with a bit of work. It seems like the fundamental idea of capturing these offsets / lengths and using range requests is sound.

If you’re interested in playing along, the code to reproduce this is at zarr-kerchunk-comparison.ipynb | notebooksharing.space (thanks to @yuvipanda for reminding me about notebook-sharing-space :slight_smile: ). All of the data are publicly / anonymously accessible, you just might need to install the “planetary-computer” package to get a short-lived token to access the data.

6 Likes

Tom this is super interesting, thanks for posting!

I’m concerned about one thing in your comparison: you’re using different chunks sizes for the two datasets. Specifically, the kerchunk chunks are (1, 600, 1440) while the Zarr chunks are (365, 600, 1440). I think that completely explains the reasons why Zarr is slower in some cases (e.g. point or partial access).

Wouldn’t a more useful comparison be to use the same chunk sizes in the Zarr as in the Kerchunk dataset. Then we could more directly evaluate the differences between the two approaches.

Whoops, let me check on that. Agreed they need to be the same for us to make meaningful conclusions. I think I misunderstood how kerchunk works, and incorrectly assumed that the chunking would match the source.

My understanding is that kerchunk will match the source chunks. HDF5 files support internal chunking, and kerchunk will try to map chunks 1:1.

Yep that’s correct, Kerchunk maps the internal chunks of individual HDF5 files 1:1. Individual files being aggregated along a dimension are treated as 1 chunk per file, which is why you’re seeing Kerchunk chunks (say that five times fast) are (1,600,1440)

We are generally finding that access times for a zarr version of a dataset and kerchunked from the original are rather similar, with differences mainly in the open/metadata phase. Caveat: copying to zarr lets you make a new decision on the encoding/compression, so you can do worse or better.

One note: you can, of course, set the dask chunk size to be a multiple of the real physical chunk size. For the case that you want throughput, this means that within each task, the chunks are loaded concurrently, and you save on latency, which for <10MB chunks can be substantial. This is true for both zarr and kerchunked. You will do slightly worse for single-point, since you load more bytes than you need.

When we have compared the Zarr library reading Zarr format vs NetCDF4 format (via fileReferenceSystem) on AWS we found the performance was the same, using the same compression and chunking.

Not too surprising, since both techniques read the metadata all in one shot, then read chunks of binary data from object storage.

With a Zarr dataset, each chunk is stored as separate object, while with fileReferenceSystem, each scientific data format file is stored as a separate object.

The only difference is reading the chunks as separate objects vs concurrent byte-range requests reading chunks from objects.

Unless the object storage can’t support the requested number of concurrent byte range requests, the results should be the same.

It’s nice to know that Azure blog storage can apparently handle a large number of concurrent byte-range requests (AWS S3 does too!). Do we know how many concurrent requests it would take before we would see less than linear scaling?

1 Like

That’s good to know Rich. I definitely would have expected the access time to basically be the same for equal chunks and compression.

Assuming that’s the case, what recommendations can we make about the best architecture for a cloud data storage with these data?

I’m leading strongly toward the following recommendation: the primary data store on the cloud should be the original netCDF files + a kerchunk index. From this, it should be really easy to create copies / views / transformations of the data that are optimized for other use cases. For example, you could easily use rechunker to transform [a subset of] the data to support a different access pattern (e.g. timeseries analysis). Or you could put xpublish on top of the dataset to provide on-the-fly computation of derived quantities.

Some of the questions in my mind are:

I believe STAC can already include “storage_options” in a definition, which would mean that a kerchunked dataset would already work, but only on python (because this is fsspec-specific). If you wanted to generalise, you would need something formal. The kerchunk idea has been demonstrated with a JS zarr frontend by @manzt , which required some custom coding.

Yes, the references in kerchunk index over all the files, so the data spec in STAC would only need point to the references JSON.

An HDF5-style crc32/fletcher32 checksum is very doable for kerchunk - but maybe unnecessary if the original files have already been verified and we trust our content-addressing scheme. this PR “copes” with the checksum by ignoring it; we could have made a numcodecs filter to optimally enforce the check easily enough.

cc @TomAugspurger I seem to recall demonstrated storage_options in STAC

Apologies for not updating with the new Zarr files to match the NetCDF chunking. I ran into Debugging memory issues in IMERG · Issue #227 · pangeo-forge/pangeo-forge-recipes · GitHub and haven’t had a chance to dig into it.

I’m glad to hear that others have already done similar benchmarks and have reached the conclusion that performance is roughly similar (which make sense, but it’s good to see confirmed).

[Rich] Do we know how many concurrent requests it would take before we would see less than linear scaling?

https://docs.microsoft.com/en-us/azure/storage/blobs/scalability-targets has some numbers. We haven’t observed slowdowns, but we / our users have hit the limits off the blob storage service, and so have had to add retry logic to their applications.

[Ryan] the primary data store on the cloud should be the original netCDF files + a kerchunk index .

I like this recommendation, primarily because it means I’m not in the awkward position of claiming we host dataset X, when it’s actually a somewhat different (but better!) cloud-optimzied version.

[Ryan] How should a sequence of netCDF files + kernchunk index be represented in STAC? Does each individual netCDF file need to go into the STAC catalog? Or does the kerchunk index effectively serve as a catalog for the individual files?

Anything is possible, but I one factor is whether you’re exposing both the
netCDF files and the Kerchunk index through the STAC API. If you’re making
both available (as I think you should), I’d recommend:

  1. Use STAC Items + assets to catalog the netCDF files.
  2. Use collection-level assets (or a separate STAC collection) for Kerchunk
    index files

Otherwise, you risk searches returning “duplicate” matches, one for the regular
netCDF files and one for the index file.

Do we need a formal kerchunk STAC extension?

I’m not sure yet. I’ve found the STAC extensions helpful to enable tools like
intake-stac be able to programatically, and safely, go from STAC → Dataset.

Currently, loading these kerchunk-style datasets looks like:

import requests

asset = collection.assets["ACCESS-CM2.historical"]

references = requests.get(asset.href).json()

reference_filesystem = fsspec.filesystem("reference", fo=references)
reference_filesystem = planetary_computer.sign(reference_filesystem)

ds = xr.open_dataset(
    reference_filesystem.get_mapper("/"),
    **asset.extra_fields["xarray:open_dataset_kwargs"]
)

The new thing here is that intermediate requests.get(...) to get the
references (I think we can’t quite just pass the URL, since the Azure Blob
Storage URLs need to be signed, but other groups might be able to skip that).
So the only additional thing needed might be a flag in the xarray metadata
indicating that it’s a kerchunk file / reference filesystem, so that tools
like intake-stac know what to do.

[Ryan] Can we use checksumming to verify the file integrity?

There’s a few places we might want to do that:

  1. From the original data provider (e.g. a THREDDS server) to Blob Storage. If
    the data provider provides checksums, we should validate against them.
  2. The kerchunk index: to verify that the data read via kerchunk / reference
    filesystem is identical to the data you would have gotten via the NetCDF file

On this second item, what exactly would we be checksumming / comparing? That the
raw bytes of the NetCDF variable is identical to what we’d we read via kerchunk?
Or is the range request made by Kerchunk just a subset of the variable’s data
(skipping a header or footer, for example).

As for storing those checksum values, I think either the STAC metadata, or
perhaps within the Kerchunk index file itself would be fine.

We are hosting NetCDF files (from USGS, NASA, etc) on our new Oracle Open Data platform which is in limited availability release right now (https://opendata.oraclecloud.com ).

I work very closely with the product manager and as our goal is to make this data accessible and simple to use, I would appreciate some guidance and assistance on the topic. For reference the data is stored in buckets that are functionally the same as S3

Would the community find value if we pre-converted to Zarr or kerchunk?

@TomAugspurger , what does this line do?

That the
raw bytes of the NetCDF variable is identical to what we’d we read via kerchunk?

That would mean repeating the same RANGE request, probably not too useful :slight_smile:

Or is the range request made by Kerchunk just a subset of the variable’s data (skipping a header or footer, for example).

It is identical to the chunk’s data, header info (compression, attributes) have already been extracted. I would suggest that we could implement the same as what HDF does, i.e., a crc/fletcher32 checksum implemented as a numcodec filter. I don’t know if there are any other checksums beyond whole-file ETags.

@RaoulM :

Would the community find value if we pre-converted to Zarr or kerchunk?

There’s a lot of discussion on when and why in pangeo-forge, I won’t repeat here. In brief: converting to zarr gets you the goodness of cloud-friendly access, and lets you choose the compression and chunking, but duplicates the data, with the associated cost. kerchunk is designed to get you cloud-optimised access to the original data (no duplication, just a small index file), but you are limited to the original chunking/compression. kerchunking will not be possible for all data sets.

btw, @TomAugspurger , would be interested to know if Boost references speed by martindurant · Pull Request #892 · fsspec/filesystem_spec · GitHub makes any difference to your timing.

It overwrites the URLs in templates with the signed versions. https://github.com/microsoft/planetary-computer-sdk-for-python/blob/27037bcb9a89da9aa498ed5b7fd03eaa8b5c61a0/planetary_computer/sas.py#L284. Not the original use-case for templates I sure, but it was handy :slight_smile:

Thanks very much. Since we get the storage at much reduced cost, Zarr may be the way to go. I will review the discussions in Pangeo-forge and work with the product manager