I’m trying to come up with some clear guidance for how to provide and access labeled n-dimensonal data from Blob Storage. I’ll post a few thoughts below, but would appreciate some feedback from the community on what they think is the best path forward.
For context, we (the Planetary Computer team, but others too I’m sure) are often provided data files as NetCDF / HDF5 by our partners. Accessing NetCDF / HDF5 files from blob storage isn’t great, as described in Matt’s post: HDF in the Cloud.
So at the moment, we’re choosing between converting all of that data to Zarr or providing “kerchunk / reference files” (is there a succinct name to give this?) to enable high-performance access in the cloud. I think most things we say here about Zarr applies equally well to formats like TileDB.
To help make an informed choice, I’ve taken a collection of NetCDF files from NASA-NEX-GDDP-CMIP6 and created
- A Zarr store
- A Kerchunk / reference filesystem
I’ll clean up the scripts used to do the conversion and will post those here. I’ve belatedly realized that I didn’t take care to ensure the compression options of the Zarr store matched the original data, which might throw of some of the timings.
EDIT: I messed up the chunking in the Zarr files, so they don’t match the (internal) chunking of the NetCDF files. So these benchmarks are not meaningful.
The raw data
Each original NetCDF file contains a single float32 data variable with dimensions (time, lat, lon)
and shape (365, 600, 1440)
.
<xarray.Dataset>
Dimensions: (time: 365, lat: 600, lon: 1440)
Coordinates:
* time (time) datetime64[ns] 1950-01-01T12:00:00 ... 1950-12-31T12:00:00
* lat (lat) float64 -59.88 -59.62 -59.38 -59.12 ... 89.38 89.62 89.88
* lon (lon) float64 0.125 0.375 0.625 0.875 ... 359.1 359.4 359.6 359.9
Data variables:
pr (time, lat, lon) float32 ...
Attributes: (12/22)
activity: NEX-GDDP-CMIP6
contact: Dr. Rama Nemani: rama.nemani@nasa.gov, Dr. Bridget...
Conventions: CF-1.7
creation_date: 2021-10-04T13:59:54.607947+00:00
frequency: day
institution: NASA Earth Exchange, NASA Ames Research Center, Mo...
... ...
history: 2021-10-04T13:59:54.607947+00:00: install global a...
disclaimer: This data is considered provisional and subject to...
external_variables: areacella
cmip6_source_id: ACCESS-CM2
cmip6_institution_id: CSIRO-ARCCSS
cmip6_license: CC-BY-SA 4.0
There are typically 9 data variables per year (each in their own NetCDF files, but sharing the same dimensions). This dataset runs from 1950 - 2014 (inclusive), for a total of 585 NetCDF files. We want to combine all of those into one logical dataset.
Both Zarr and the kerchunk / reference filesystem achieve our original goal: we can quickly read the metadata for these collections in ~1 second (as a rough estimate, it would take ~10 minutes to read the metadata for all the NetCDF files. All the data lives in Blob Storage, and the data and compute are both in Azure’s West Europe region.
To get a sense for performance, I compared some indexing operations (selecting a point, a chunk, a timeseries, …) for a single variable and for all 9 variables (full code in the notebook). Here are the timings in tabular form:
source | operation | time |
---|---|---|
zarr | all variables-point | 19.9646 |
references | all variables-point | 0.189757 |
zarr | all variables-chunk (partial) | 20.2704 |
references | all variables-chunk (partial) | 11.8032 |
zarr | all variables-chunk (full) | 20.0788 |
references | all variables-chunk (full) | 51.4636 |
zarr | single variable-point | 6.8371 |
references | single variable-point | 0.0267782 |
zarr | single variable-chunk (partial) | 6.90783 |
references | single variable-chunk (partial) | 1.37082 |
zarr | single variable-chunk (full) | 6.92691 |
references | single variable-chunk (full) | 5.73868 |
zarr | single variable-timeseries (point) | 137.09 |
references | single variable-timeseries (point) | 308.311 |
and graphically
Some thoughts:
- I really appreciate the goal of kerchunk / reference filesystem. The data providers don’t have to upgrade their systems to produce a new file format. We, the hosts, don’t have to go through a sometimes challenging process to do the conversion to Zarr (though that’s getting easier with pangeo-forge). But we still get cloud-friendly access.
- I worry a bit about pushing a new spec / system for accessing data. But that said, the original NetCDF files are there as a fallback.
- The libraries for generating and reading these reference files are young, missing features, and sometimes buggy. But all of that can be solved with a bit of work. It seems like the fundamental idea of capturing these offsets / lengths and using range requests is sound.
If you’re interested in playing along, the code to reproduce this is at zarr-kerchunk-comparison.ipynb | notebooksharing.space (thanks to @yuvipanda for reminding me about notebook-sharing-space ). All of the data are publicly / anonymously accessible, you just might need to install the “planetary-computer” package to get a short-lived token to access the data.