Best practice to store and load data-columns of equal-length from GCS (data not on a regular grid)


I thought to ask on here, if am I missing out on even faster ways to load data.
From several zarr-stores that each store 5 data-columns of equal length, I want to load 4 of them into dask-worker-memory (columnlength ~2e6, and size of each columns is 17MB) and use them in embarrasingly parallel computations. The zarr-store with data and metadata is saved to GCS, and I start a cluster of workers using the GC-jupyter-hub-deployment, in the same way illustrated in this notebook.

So far I load data in ~1 sec using either xr.open_zarr(mapper, consolidated=True, chunks='auto') or zarr.open_consolidated(mapper) as shown further down in the notebook. In other words, very approximately 17MB*4=68MB/sec.

  • Can I do better in the way I create the dataset and the zarr store?
  • When I want to load these 4 columns, what would a more optimal, or conventional, loading-function look like?

Would be happy to try out any advice.

Edit: From reading answers in a similar looking topic (but with different dataset-structure) I guess I could try putting the 4 columns in 1 variable instead of 4. But perhaps there exist a way around that?

Best, Ola

I client.scatter data to workers beforehand instead.