Pangeo Showcase: "VirtualiZarr: Create virtual Zarr stores using xarray syntax"

rsignell · March 27, 2024, 5:25pm

Title: “VirtualiZarr: Create virtual Zarr stores using xarray syntax”
Invited Speaker: Tom Nicholas (ORCID: 0000-0002-2176-0530)
When: Wednesday, May 15, 12PM EDT
Where: Launch Meeting - Zoom
Abstract:
The Kerchunk idea solves an incredibly important problem: accessing big archival datasets via a cloud-optimized pattern, but without copying or modifying the original data in any way. This is a win-win-win for users, data engineers, and data providers. Users see fast-opening zarr-compliant stores that work performantly with libraries like xarray and dask, data engineers can provide this speed by adding a lightweight virtualization layer on top of existing data (without having to ask anyone’s permission), and data providers don’t have to change anything about their legacy files for them to be used in a cloud-optimized way.

However, kerchunk’s current design (especially MultiZarrToZarr) is limited:

Store-level abstractions make combining datasets complicated, idiosyncratic, and requires duplicating logic that already exists in libraries like xarray,
The kerchunk format for storing on-disk references requires the caller to understand it, usually via fsspec (which is currently only implemented in python).

VirtualiZarr aims to build on the excellent ideas of kerchunk whilst solving the above problems:

Using array-level abstractions instead is more modular, easier to reason about, allows convenient wrapping by high-level tools like xarray, and is simpler to parallelize,
Writing the virtualized arrays out as a valid Zarr store directly (through new Zarr Extensions) will allow for Zarr implementations in any language to read the archival data.

We will talk about the motivation for, current status of, and future plans for the VirtualiZarr package as a means of gaining the power of kerchunk in a fully Zarr-native way.

20 minutes - Community Showcase
40 minutes - Showcase discussion/Community check-ins

TomNicholas · May 15, 2024, 6:01pm

Talk slides are here

Topic		Replies	Views
September 21th 2022: Accessing NetCDF and GRIB file collections as cloud-native virtual datasets using Kerchunk Pangeo Showcase	0	1202	September 19, 2022
Pangeo Showcase: "Cloud Native Data Loaders for Machine Learning Using Zarr and Xarray" Pangeo Showcase machine-learning	6	929	October 25, 2024
Pangeo Showcase: "Zarr Python 3 and beyond" (March 05, 2025) Pangeo Showcase	3	288	March 5, 2025
Kerchunk planning News & Announcements	36	1151	April 14, 2024
Pangeo Showcase: "HDF5 at the Speed of Zarr" Pangeo Showcase	13	1546	August 13, 2024

Pangeo Showcase: "VirtualiZarr: Create virtual Zarr stores using xarray syntax"

Related topics