Pangeo Showcase: "VirtualiZarr: Create virtual Zarr stores using xarray syntax"

Title: “VirtualiZarr: Create virtual Zarr stores using xarray syntax”
Invited Speaker: Tom Nicholas (ORCID: 0000-0002-2176-0530)
When: Wednesday, May 15, 12PM EDT
Where: Launch Meeting - Zoom
Abstract:
The Kerchunk idea solves an incredibly important problem: accessing big archival datasets via a cloud-optimized pattern, but without copying or modifying the original data in any way. This is a win-win-win for users, data engineers, and data providers. Users see fast-opening zarr-compliant stores that work performantly with libraries like xarray and dask, data engineers can provide this speed by adding a lightweight virtualization layer on top of existing data (without having to ask anyone’s permission), and data providers don’t have to change anything about their legacy files for them to be used in a cloud-optimized way.

However, kerchunk’s current design (especially MultiZarrToZarr) is limited:

  • Store-level abstractions make combining datasets complicated, idiosyncratic, and requires duplicating logic that already exists in libraries like xarray,
  • The kerchunk format for storing on-disk references requires the caller to understand it, usually via fsspec (which is currently only implemented in python).

VirtualiZarr aims to build on the excellent ideas of kerchunk whilst solving the above problems:

  • Using array-level abstractions instead is more modular, easier to reason about, allows convenient wrapping by high-level tools like xarray, and is simpler to parallelize,
  • Writing the virtualized arrays out as a valid Zarr store directly (through new Zarr Extensions) will allow for Zarr implementations in any language to read the archival data.

We will talk about the motivation for, current status of, and future plans for the VirtualiZarr package as a means of gaining the power of kerchunk in a fully Zarr-native way.

  • 20 minutes - Community Showcase
  • 40 minutes - Showcase discussion/Community check-ins
3 Likes