Synchronizer for Zarr + Dask on Kubernetes

Hi @Leonard_Strnad I believe you work now with my awesome former colleague Martha, welcome to the discourse. There may be 2 higher level questions here that might be worth considering before committing to building a large archive. The first is what is the ideal representation model for the underlying data. Recently @TomAugspurger has begun thinking about alternative, efficient representations for sparse EO data https://discourse.pangeo.io/t/tables-x-arrays-and-rasters/1945 It is definitely worthwhile reviewing his thread as there is still significant flux in the community around how to approach sparse datasets.

The second question concerns the underlying data storage mechanism. If your collection of Tiffs exists in GCS, STAC can provide a mechanism to reference and access the underlying bytes without the need to ingest them into a Zarr archive. As you noted there is ongoing work on GitHub - gjoseph92/stackstac: Turn a STAC catalog into a dask-based xarray to improve the interoperability between STAC and xarray.

The community has not yet embraced a single solution for this question. Decision factors include the spatial and temporal distribution of your data, the storage access patterns of your use case and the mutability of your data collection over time. These are questions that lots of teams are grappling with at the moment so it would be great to keep an open dialogue going around this in this thread (or another if that makes more sense) :]

1 Like