Pangeo Showcase: "Icechunk: An Open-Source Transactional Storage Engine for Zarr"

Title: “Icechunk: An Open-Source Transactional Storage Engine for Zarr”
Invited Speaker: Ryan Abernathy (ORCID: 0000-0001-5999-4917)
When: Wednesday, October 23, 2024 at 4 PM EDT
Where: Launch Meeting - Zoom
Abstract:
Over the past year or two, the mainstream cloud data community has witnessed remarkable convergence across major data platforms around so-called “table formats” such as Iceberg, Hudi, and Delta Lake. These table formats organize many individual Parquet files into a single logical table, supporting database style operations on top of vanilla cloud object storage (e.g. S3) and interoperability across multiple different analytical query engines (Spark, DuckDB, Snowflake, etc.)

But what about the exabytes of scientific data—from domains such as weather, climate, oceanography, astronomy, bioinformatics, materials science, deep learning, etc.—which originate in formats like HDF5 and can’t easily be mapped to the tabular data model? Scientists and scientific organizations working with multidimensional array (aka tensor) data are in urgent need of technologies to enable them to take advantage of cloud computing in order to enhance global collaboration and accelerate scientific progress.

Speaking as a climate scientist, it’s clear that our community is still struggling to figure out how to best take advantage of the cloud to leverage our massive volume of environmental data. Many of the questions revolve around file formats. Agency led efforts such as the NASA Earthdata Cloud and the NOAA Open Data Dissemination Program (NODD) have used a “lift and shift” approach to migrate petabytes of existing archival files (mostly NetCDF and GRIB) to cloud object storage, without much attention to supporting modern, cloud-native analytical workflows. The Zarr format has demonstrated promise as a more performant, scalable way of storing scientific data in the cloud, but the inertia behind the archival formats is strong.

Within this landscape, we are excited to announce the release of Icechunk—a new open-source library and specification for the storage of array data in cloud object storage which resolves many of the challenges facing the scientific community today. Inspired by the table formats described above, Icechunk functions as a universal storage engine for Zarr, supporting both “native” Zarr data as well as virtual Zarr-compatible data stored in archival files (building on the pioneering work of Kerchunk). Zarr describes datasets as hierarchical tree of multidimensional arrays of arbitrary size–split into many “chunks”–plus user-defined attributes for tracking metadata. Icechunk introduces database-style transactions on top of Zarr, enabling safe, consistent updates to large datasets in an operational context. The core requirements behind Icechunk are as follows:

  • Object Storage Only - The full state of an Icechunk store is contained in cloud object storage, without the need for an external database.
  • Serializable isolation - Reads are isolated from concurrent writes and always use a committed snapshot. Writes are committed via a single atomic operation and are never partially visible. Readers do not acquire locks.
  • Chunk sharding and references - Chunk storage is decoupled from specific file names. Multiple chunks can be packed into a single object (sharding). Zarr-compatible chunks within other file formats (e.g. HDF5, NetCDF, GRIB) can be referenced.
  • Time travel - Previous snapshots of a store remain accessible after new ones have been written. Accessing an earlier snapshot is trivial and inexpensive.
  • Schema Evolution - Arrays and Groups can be added, renamed, and removed from the hierarchy with minimal overhead.

This talk will present the design of the Icechunk specification, our Rust-based implementation, and its Python bindings. We will demonstrate the integration with popular Python tools such as Xarray, Dask, and VirtualiZarr. We will also showcase real-world end-to-end workflows building Icechunk stores for both native and virtualized Zarr datasets.
Agenda:

  • ~15 minutes - Showcase presentation
  • 10 - 30 minutes - Discussion
  • 15 - 30 minutes - Community check-in
9 Likes

@rabernat or @jhamman: Is the recording of yesterday’s Earthmover talk on Icechunk generally available? Was thinking to watch before today’s Pangeo Showcase talk to come with even better questions. :slight_smile:

1 Like
3 Likes

Thanks for sharing it @TomAugspurger .

I won’t be able to attend the showcase section today, but it would be nice if @rabernat includes on the section “how Icechunck does compare to X” a comparison with more traditional Array databases, such as Rasdaman and SciDB, which are used by some of the Remote Sensing community in data retrieval and permanent archival.

A new database to the current Zoo of DBMS may be good, but it is nice to see how it advances the state of things (as mentioned by Stonebracker and Pavlo earlier this year at SIGMOD), as DBMS are often complex systems requiring lots of effort in development and, once in production, in maintenance.

Hi @rlourenco - thanks for the question!

We’d welcome a documentation PR to add such a section to the FAQ. :wink: Sounds like you’ve already thought about this question, so you might be a good person to write the first draft.

Good question! Thanks Rich.

1 Like