Cloud array storage solutions

Hey all, for some context I’ve spent most my time previously working with tabular data in PySpark - so I’m quite new to working with Pandas / Numpy and the geoscience space in general! :wave:

I’ve been reading lots of posts and looking around for what might be considered the state-of-the-art in terms of storing large ND array data in the cloud. The conclusion I’m starting to reach is that persisting arrays from within my own Python/Dask pipelines probably ought to be done with Zarr, but that the majority of data distribution still happens in NetCDF, GeoTIFF, HDF5, etc. Does that sound about right?

My second question is whether any of these tools provide some of the semantics that the more recent table formats (such as Delta, Hudi, Iceberg) provide? e.g. support for multiple concurrent writers, time travel, and incremental updates (writing lots of small updates and then running a compaction). I think these features would be useful to me, but maybe I’m still a bit stuck in my old ways of thinking about the world! I’m sure I’ll have a better understanding in a few weeks but wanted to ask around up front in case others have run into these issues.

Appreciate any responses - this forum has been invaluable to me so far :grinning:

1 Like

Hi @smallart and welcome to the forum!

This is definitely the consensus best practice in our community. It is straightforward to create Zarr from anything Xarray can read. The Pangeo Forge project is aiming to provide an even easier and more scalable way to produce cloud-based Zarr data from other file formats. The problem of translating Zarr back to other formats can be handled easily with Xarray. But I don’t know of a specific tool focused on that space.

Zarr does support multiple concurrent writers, as long as the writers align themselves with chunk boundaries in a non-overlapping way. But it sounds like you should be investigating TileDB. It offers many of those features. TileDB is a great nd-array format. Unfortunately it is under-used in Pangeo because we do not have the ability to write TileDB from Xarray. See Writeable xarray backend? · Issue #112 · TileDB-Inc/TileDB-CF-Py · GitHub for some context on that.

It’s been a while since the last reply but I think it’s a very interesting topic.

Things have changed on the TileDB front. The company has published TileDB-CF-Py which adds support for Xarray and NetCDF. I’ve been playing around with it for a few days and the experience of using it is good enough despite being under initial development.

I’m trying to build a cloud-based geodata store that interfaces with the xarray data model. The current implementation uses Open Data Cube and I’m not very happy with its performance, architecture, support, and cost (it requires a database and maintaining it can get very very expensive, especially for large, PB-scale datasets). However, its API is good and supports several querying methods, including passing geometries. I’m trying to replace that with a TileDB instance and a custom Python API or something with Zarr. I’m leaning towards TileDB because Zarr requires fsspec, the performance of which is abysmal when it comes to S3. You can read my other thread here.

I was wondering if anybody has some feedback

I think we are getting to the bottom of @sotosoul’s performance issue with Zarr here: Extremly slow write to S3 bucket with xarray.Dataset.to_zarr - #31 by rabernat

The good news is that there is a fix already in the works in Xarray: Add initialize_zarr by dcherian · Pull Request #8460 · pydata/xarray · GitHub