Pangeo Showcase: "Cloud Native Data Loaders for Machine Learning Using Zarr and Xarray"

Title: “Cloud Native Data Loaders for Machine Learning Using Zarr and Xarray”
Invited Speaker: Joe Hamman (ORCID:0000-0001-7479-8439)
When: Wednesday March 20, 12PM EDT
Where: Launch Meeting - Zoom
Abstract: We lack well established patterns for streaming scientific data from cloud object storage into machine learning frameworks. This presentation will review a recent blog post we wrote (Cloud native data loaders for machine learning using Zarr and Xarray | Earthmover) describing one such pattern that uses Xarray, Zarr, Xbatcher, and PyTorch to build a cloud-native dataloader for scientific data. I will explain how the dataloader works, outline the benchmark results, and discuss where we could go from here.

  • 20 minutes - Community Showcase
  • 40 minutes - Showcase Discussion/Community Check-ins
3 Likes

There is also a blog post version of this talk here:

@rsignell - any guesses what COG performance would be like here? Or kerchunked version of?

Maybe ask Scott Henderson?

We benchmarked dataloading COGs from object storage that come in various sizes, from Kb range to 10s of Mb. Without some kind of sharding, IO is a big bottleneck for Kb range COGs. I think this was a network request bottleneck. With sharding by pre-batching COGs and storing the bytes in parquet, IO performance doubled for the same dataset.

I’m not sure if there are COG versions of ERA5 but I think the general principles are that it is more difficult to internally “rechunk” COGs and to externally “rechunk” COGs by splitting them up into smaller files than with Zarr. So it is harder to tune dataloading performance starting from a COG dataset than a Zarr.

However for dataloaders for ML, it’s really useful to maintain the projection CRS for each COG so that results can be georeferenced. I don’t see a way to maintain multiple CRSs in a single Zarr store while taking advantage of sharding and chunking. I hope to get agreement from the community that this is something rioxarray + the geozarr spec could handle. This comment is very relevant: Flexible coordinate transform by benbovy · Pull Request #9543 · pydata/xarray · GitHub

Ryan, in your scenario, is each image the same size in terms of pixels, such that they can all stack into a regular array?

1 Like

why are these COGs so tiny? they surely don’t have internal chunking or overviews at that size? is it a lot of missing data that compresses away?

v interested to see your workflows (I know I’m a broken record here … but still), I see a lot of workflows that are way too far downstream in python (or R) packages and often have way tighter options when you work closer to the API, in whatever lang