Best practice to store and load data-columns of equal-length from GCS (data not on a regular grid)

ofk123 · July 28, 2023, 1:19pm

Hi!

I thought to ask on here, if am I missing out on even faster ways to load data.
From several zarr-stores that each store 5 data-columns of equal length, I want to load 4 of them into dask-worker-memory (columnlength ~2e6, and size of each columns is 17MB) and use them in embarrasingly parallel computations. The zarr-store with data and metadata is saved to GCS, and I start a cluster of workers using the GC-jupyter-hub-deployment, in the same way illustrated in this notebook.

So far I load data in ~1 sec using either xr.open_zarr(mapper, consolidated=True, chunks='auto') or zarr.open_consolidated(mapper) as shown further down in the notebook. In other words, very approximately 17MB*4=68MB/sec.

Can I do better in the way I create the dataset and the zarr store?
When I want to load these 4 columns, what would a more optimal, or conventional, loading-function look like?

Would be happy to try out any advice.

Edit: From reading answers in a similar looking topic (but with different dataset-structure) I guess I could try putting the 4 columns in 1 variable instead of 4. But perhaps there exist a way around that?

Best, Ola

github.com

ofk123/dataloading-example/blob/main/pangeo_loading_data_into_workermemory.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import xarray as xr\n",
    "import numpy as np\n",
    "from numpy import pi, sin, cos, arccos, clip, deg2rad\n",
    "import numpy.ma as ma\n",
    "from datetime import datetime\n",
    "import dask\n",
    "import time\n",
    "import zarr"
   ]
  },
  {
   "cell_type": "code",

This file has been truncated. show original

ofk123 · August 28, 2023, 12:39am

I client.scatter data to workers beforehand instead.

Topic		Replies	Views
Best practice reading zarr from s3 Cloud	8	4524	July 28, 2022
Fastest way to open many large zarr stores Cloud	1	482	October 20, 2022
Saving larger-than-memory objects to zarr using dask and xarray Data zarr	9	623	December 3, 2024
Synchronizer for Zarr + Dask on Kubernetes Data	10	1843	January 16, 2024
Extremely slow xarray/zarr writes Data	5	552	August 22, 2024

Best practice to store and load data-columns of equal-length from GCS (data not on a regular grid)

Related topics