Pangeo Showcase: "Icechunk: An Open-Source Transactional Storage Engine for Zarr"

rsignell · October 15, 2024, 7:52pm

Title: “Icechunk: An Open-Source Transactional Storage Engine for Zarr”
Invited Speaker: Ryan Abernathy (ORCID: 0000-0001-5999-4917)
When: Wednesday, October 23, 2024 at 4 PM EDT
Where: Launch Meeting - Zoom
Abstract:
Over the past year or two, the mainstream cloud data community has witnessed remarkable convergence across major data platforms around so-called “table formats” such as Iceberg, Hudi, and Delta Lake. These table formats organize many individual Parquet files into a single logical table, supporting database style operations on top of vanilla cloud object storage (e.g. S3) and interoperability across multiple different analytical query engines (Spark, DuckDB, Snowflake, etc.)

But what about the exabytes of scientific data—from domains such as weather, climate, oceanography, astronomy, bioinformatics, materials science, deep learning, etc.—which originate in formats like HDF5 and can’t easily be mapped to the tabular data model? Scientists and scientific organizations working with multidimensional array (aka tensor) data are in urgent need of technologies to enable them to take advantage of cloud computing in order to enhance global collaboration and accelerate scientific progress.

Speaking as a climate scientist, it’s clear that our community is still struggling to figure out how to best take advantage of the cloud to leverage our massive volume of environmental data. Many of the questions revolve around file formats. Agency led efforts such as the NASA Earthdata Cloud and the NOAA Open Data Dissemination Program (NODD) have used a “lift and shift” approach to migrate petabytes of existing archival files (mostly NetCDF and GRIB) to cloud object storage, without much attention to supporting modern, cloud-native analytical workflows. The Zarr format has demonstrated promise as a more performant, scalable way of storing scientific data in the cloud, but the inertia behind the archival formats is strong.

Within this landscape, we are excited to announce the release of Icechunk—a new open-source library and specification for the storage of array data in cloud object storage which resolves many of the challenges facing the scientific community today. Inspired by the table formats described above, Icechunk functions as a universal storage engine for Zarr, supporting both “native” Zarr data as well as virtual Zarr-compatible data stored in archival files (building on the pioneering work of Kerchunk). Zarr describes datasets as hierarchical tree of multidimensional arrays of arbitrary size–split into many “chunks”–plus user-defined attributes for tracking metadata. Icechunk introduces database-style transactions on top of Zarr, enabling safe, consistent updates to large datasets in an operational context. The core requirements behind Icechunk are as follows:

Object Storage Only - The full state of an Icechunk store is contained in cloud object storage, without the need for an external database.
Serializable isolation - Reads are isolated from concurrent writes and always use a committed snapshot. Writes are committed via a single atomic operation and are never partially visible. Readers do not acquire locks.
Chunk sharding and references - Chunk storage is decoupled from specific file names. Multiple chunks can be packed into a single object (sharding). Zarr-compatible chunks within other file formats (e.g. HDF5, NetCDF, GRIB) can be referenced.
Time travel - Previous snapshots of a store remain accessible after new ones have been written. Accessing an earlier snapshot is trivial and inexpensive.
Schema Evolution - Arrays and Groups can be added, renamed, and removed from the hierarchy with minimal overhead.

This talk will present the design of the Icechunk specification, our Rust-based implementation, and its Python bindings. We will demonstrate the integration with popular Python tools such as Xarray, Dask, and VirtualiZarr. We will also showcase real-world end-to-end workflows building Icechunk stores for both native and virtualized Zarr datasets.
Agenda:

~15 minutes - Showcase presentation
10 - 30 minutes - Discussion
15 - 30 minutes - Community check-in

rsignell · October 23, 2024, 2:21pm

@rabernat or @jhamman: Is the recording of yesterday’s Earthmover talk on Icechunk generally available? Was thinking to watch before today’s Pangeo Showcase talk to come with even better questions.

TomAugspurger · October 23, 2024, 2:32pm

rlourenco · October 23, 2024, 3:39pm

Thanks for sharing it @TomAugspurger .

I won’t be able to attend the showcase section today, but it would be nice if @rabernat includes on the section “how Icechunck does compare to X” a comparison with more traditional Array databases, such as Rasdaman and SciDB, which are used by some of the Remote Sensing community in data retrieval and permanent archival.

A new database to the current Zoo of DBMS may be good, but it is nice to see how it advances the state of things (as mentioned by Stonebracker and Pavlo earlier this year at SIGMOD), as DBMS are often complex systems requiring lots of effort in development and, once in production, in maintenance.

rabernat · October 23, 2024, 3:41pm

Hi @rlourenco - thanks for the question!

We’d welcome a documentation PR to add such a section to the FAQ. Sounds like you’ve already thought about this question, so you might be a good person to write the first draft.

github.com

earth-mover/icechunk/blob/main/docs/docs/faq.md

---
title: Frequently Asked Questions 
---

# FAQ

## Why was Icechunk created?

Icechunk was created by [Earthmover](https://earthmover.io/) as the open-source format for its cloud data platform [Arraylake](https://docs.earthmover.io).

Icechunk builds on the successful [Zarr](https://zarr.dev) project.
Zarr is a great foundation for storing and querying large multidimensional array data in a flexible, scalable way.
But when people started using Zarr together with cloud object storage in a collaborative way, it became clear that Zarr alone could not offer the sort of consistency many users desired.
Icechunk makes Zarr work a little bit more like a database, enabling different users / processes to safely read and write concurrently, while still only using object storage as a persistence layer.

Another motivation for Icechunk was the success of [Kerchunk](https://github.com/fsspec/kerchunk/).
The Kerchunk project showed that it was possible to map many existing archival formats (e.g. HDF5, NetCDF, GRIB) to the Zarr data model without actually rewriting any bytes, by creating "virtual" Zarr datasets referencing binary chunks inside other files.
Doing this at scale requires tracking millions of "chunk references."
Icechunk's storage model allows for these virtual chunks to be stored seamlessly alongside native Zarr chunks.

This file has been truncated. show original

RichardScottOZ · October 23, 2024, 6:08pm

Good question! Thanks Rich.

Topic		Replies	Views
Icechunk: A new cloud-native transactional storage engine for Zarr Cloud	1	229	October 16, 2024
OPeNDAP vs. direct file access Data	32	4540	January 27, 2021
Cloud array storage solutions Data	3	1210	November 29, 2023
Extremely slow rechunking of Zarr store with xarray Data	16	4140	October 22, 2021
Recommendation for hosting cloud-optimized data Data	15	2866	January 21, 2022

Pangeo Showcase: "Icechunk: An Open-Source Transactional Storage Engine for Zarr"

Related topics