Fundamentals: What is Cloud-Optimized Scientific Data?

TomNicholas · April 17, 2025, 8:43pm

I wrote the article about the cloud that I wish I could have read back when I first heard of Zarr and cloud-native science in 2018.

I hope that people here find this useful, as a reference or as clarification of certain subtleties.

It also includes a (slightly contrived) benchmark of reading HDF from object storage as if it were stored as a local file vs reading Zarr and Icechunk.

I feel that this benchmark is a more streamlined (it disables caching entirely) demonstration of the challenge explored by @betolink and others, e.g. in his Pangeo Showcase:

–

This post is actually the second in Earthmover’s “Fundamentals” series - the first was @rabernat’s post last week on “Tensors vs Tables”:

Topic		Replies	Views
Pangeo Showcase: "HDF5 at the Speed of Zarr" Pangeo Showcase	13	2220	August 13, 2024
Pangeo Showcase: "Icechunk: An Open-Source Transactional Storage Engine for Zarr" Pangeo Showcase	5	750	October 23, 2024
Suggested database for large amount of NetCDF data Data	13	3377	April 7, 2022
New Cloud Tensor I/O Benchmarks - Zarr is fast now!	4	327	December 1, 2025
Webinar on cloud-optimized HDF reading	2	485	May 6, 2021

Fundamentals: What is Cloud-Optimized Scientific Data?

Related topics