Cloud array storage solutions

Hey all, for some context I’ve spent most my time previously working with tabular data in PySpark - so I’m quite new to working with Pandas / Numpy and the geoscience space in general! :wave:

I’ve been reading lots of posts and looking around for what might be considered the state-of-the-art in terms of storing large ND array data in the cloud. The conclusion I’m starting to reach is that persisting arrays from within my own Python/Dask pipelines probably ought to be done with Zarr, but that the majority of data distribution still happens in NetCDF, GeoTIFF, HDF5, etc. Does that sound about right?

My second question is whether any of these tools provide some of the semantics that the more recent table formats (such as Delta, Hudi, Iceberg) provide? e.g. support for multiple concurrent writers, time travel, and incremental updates (writing lots of small updates and then running a compaction). I think these features would be useful to me, but maybe I’m still a bit stuck in my old ways of thinking about the world! I’m sure I’ll have a better understanding in a few weeks but wanted to ask around up front in case others have run into these issues.

Appreciate any responses - this forum has been invaluable to me so far :grinning:

1 Like

Hi @smallart and welcome to the forum!

This is definitely the consensus best practice in our community. It is straightforward to create Zarr from anything Xarray can read. The Pangeo Forge project is aiming to provide an even easier and more scalable way to produce cloud-based Zarr data from other file formats. The problem of translating Zarr back to other formats can be handled easily with Xarray. But I don’t know of a specific tool focused on that space.

Zarr does support multiple concurrent writers, as long as the writers align themselves with chunk boundaries in a non-overlapping way. But it sounds like you should be investigating TileDB. It offers many of those features. TileDB is a great nd-array format. Unfortunately it is under-used in Pangeo because we do not have the ability to write TileDB from Xarray. See Writeable xarray backend? · Issue #112 · TileDB-Inc/TileDB-CF-Py · GitHub for some context on that.