I thought to ask on here, if am I missing out on even faster ways to load data.
From several zarr-stores that each store 5 data-columns of equal length, I want to load 4 of them into dask-worker-memory (columnlength ~2e6, and size of each columns is 17MB) and use them in embarrasingly parallel computations. The zarr-store with data and metadata is saved to GCS, and I start a cluster of workers using the GC-jupyter-hub-deployment, in the same way illustrated in this notebook.
So far I load data in ~1 sec using either
xr.open_zarr(mapper, consolidated=True, chunks='auto') or
zarr.open_consolidated(mapper) as shown further down in the notebook. In other words, very approximately 17MB*4=68MB/sec.
- Can I do better in the way I create the dataset and the zarr store?
- When I want to load these 4 columns, what would a more optimal, or conventional, loading-function look like?
Would be happy to try out any advice.
Edit: From reading answers in a similar looking topic (but with different dataset-structure) I guess I could try putting the 4 columns in 1 variable instead of 4. But perhaps there exist a way around that?