Pangeo Showcase: "Cloud Native Data Loaders for Machine Learning Using Zarr and Xarray"

rsignell · March 19, 2024, 7:02pm

Title: “Cloud Native Data Loaders for Machine Learning Using Zarr and Xarray”
Invited Speaker: Joe Hamman (ORCID:0000-0001-7479-8439)
When: Wednesday March 20, 12PM EDT
Where: Launch Meeting - Zoom
Abstract: We lack well established patterns for streaming scientific data from cloud object storage into machine learning frameworks. This presentation will review a recent blog post we wrote (Cloud native data loaders for machine learning using Zarr and Xarray | Earthmover) describing one such pattern that uses Xarray, Zarr, Xbatcher, and PyTorch to build a cloud-native dataloader for scientific data. I will explain how the dataloader works, outline the benchmark results, and discuss where we could go from here.

20 minutes - Community Showcase
40 minutes - Showcase Discussion/Community Check-ins

rabernat · October 23, 2024, 12:31pm

There is also a blog post version of this talk here:

RichardScottOZ · October 23, 2024, 6:12pm

@rsignell - any guesses what COG performance would be like here? Or kerchunked version of?

rsignell · October 23, 2024, 9:51pm

Maybe ask Scott Henderson?

Ryan_Avery · October 23, 2024, 10:21pm

We benchmarked dataloading COGs from object storage that come in various sizes, from Kb range to 10s of Mb. Without some kind of sharding, IO is a big bottleneck for Kb range COGs. I think this was a network request bottleneck. With sharding by pre-batching COGs and storing the bytes in parquet, IO performance doubled for the same dataset.

I’m not sure if there are COG versions of ERA5 but I think the general principles are that it is more difficult to internally “rechunk” COGs and to externally “rechunk” COGs by splitting them up into smaller files than with Zarr. So it is harder to tune dataloading performance starting from a COG dataset than a Zarr.

However for dataloaders for ML, it’s really useful to maintain the projection CRS for each COG so that results can be georeferenced. I don’t see a way to maintain multiple CRSs in a single Zarr store while taking advantage of sharding and chunking. I hope to get agreement from the community that this is something rioxarray + the geozarr spec could handle. This comment is very relevant: Flexible coordinate transform by benbovy · Pull Request #9543 · pydata/xarray · GitHub

rabernat · October 23, 2024, 11:51pm

Ryan, in your scenario, is each image the same size in terms of pixels, such that they can all stack into a regular array?

Michael_Sumner · October 25, 2024, 2:03am

why are these COGs so tiny? they surely don’t have internal chunking or overviews at that size? is it a lot of missing data that compresses away?

v interested to see your workflows (I know I’m a broken record here … but still), I see a lot of workflows that are way too far downstream in python (or R) packages and often have way tighter options when you work closer to the API, in whatever lang

Topic		Replies	Views
Blog post: cloud native data loaders for machine learning using zarr and xarray News & Announcements machine-learning	0	281	March 14, 2024
DL Training Dataset - to Zarr or not to Zarr? Data	2	111	July 17, 2025
Best Practice for Machine Learning with Huge Datasets Data machine-learning	1	471	October 26, 2024
Best practice to store and load data-columns of equal-length from GCS (data not on a regular grid) Pangeo Cloud Support	1	476	August 28, 2023
Xarray slow read on cluster Data machine-learning	4	230	November 3, 2024

Pangeo Showcase: "Cloud Native Data Loaders for Machine Learning Using Zarr and Xarray"

Related topics