Stream Zarr data from HPC to local machine

rybchuk · July 22, 2021, 8:38pm

Hi folks,

I have generated a Zarr store (~10 TB) on my university’s HPC system, and I would like to do some analysis on the dataset. My university has a Jupyterhub set up, but it doesn’t always play nicely with Dask-style computations. I have been able to come up with some simple-enough workarounds while on the HPC system, but I was wondering: is there a way to stream the data from my Zarr store onto my local computer? I have seen a bunch of demos where people stream from AWS, but I haven’t found any regarding HPC systems.

andersy005 · July 22, 2021, 9:01pm

You may find GitHub - xarray-contrib/xpublish: Publish Xarray Datasets via a REST API. to be useful for this kind of use case.

Tutorial: Tutorial — Xpublish 0.1.0.post14 documentation

andersy005 · July 22, 2021, 9:06pm

My previous reply presumes that this is a zarr store that can be read with xarray. If you are working with raw zarr store, I am not sure xpublish can handle this use case.

rybchuk · July 22, 2021, 9:14pm

This is amazing, thank you Anderson! I am indeed using xarray to handle this analysis. I’ll check this out and follow up if I run into any major roadblocks.

martindurant · July 28, 2021, 3:07pm

Some more primitive/experimental solutions that might work for you:

the jupyter filesystem of fsspec, which allows your local process to see whatever your remote jupyter kernel sees (whole files at a time), assuming the remote kernel has all the files on its local file system. You’ll need your jupyter token.
the dask filesystem, if you run a localcluster in hpc, you can view any filesystem on a worker.

You may also want to run an Intake server, which can transmit zarr data natively, but needs you to set up a prescription for your datasets.

rabernat · July 28, 2021, 3:22pm

The key question here is what kind of network connectivity exists between your HPC system and your local machine. The reason we have so many examples loading data from AWS S3 (and similar cloud object stores) is that these storage systems have extremely high bandwidth to their colocated cloud computing regions. If you HPC system is sitting behind a standard network connection, that will limit how fast you can get data out. HPCs systems are generally high security, so you may not be able to access the HPC storage from outside the cluster. Furthermore, even if you can resolve these problems, the HPC system disk may not be able to deliver high read throughput. You should talk to your HPC system administrator to find out more.

In general, if you are working with HPC, we strongly recommend using the HPC itself for your Pangeo workloads. There are lots of people doing this. There are many good solutions for deploying dask on HPC (Dask Jobqueue, DaskMPI). Your time is probably better spent figuring out how to get these working, rather than trying to get the data out of your HPC system.

rybchuk · July 28, 2021, 6:21pm

Thanks @martindurant and @rabernat ! I’ve been busy converting a bunch of netCDF data to Zarr this week, but once I’m done, I’m going to try some of these approaches.

This is good to know about. While the university-maintained/“public” Jupyterhub can be a pain to work with, my analysis code works great if I launch a SLURM batch job and tunnel into that node. However, this approach means that I need to burn my computational budget and, more frustratingly, potentially wait hours for the batch job to run. I don’t know what the throughput between HPC and my personal computer will look like (or if I will run into security issues), but I’m curious to see. In the end, it might not work out, and I will return to grappling with analysis on the HPC system itself

mgrover1 · July 30, 2021, 1:12pm

You could use Jupyter-forward to deal with the port forwarding and tunneling into compute nodes in a Jupyter environment from your local machine! I’ve found this to be helpful when not wanting to deal with the main JupyterHub directly and installing custom environments.

Ankur_Mahesh · March 12, 2024, 6:35pm

Thanks so much for the help and the informative thread. "In general, if you are working with HPC, we strongly recommend using the HPC itself for your Pangeo workloads. "

Is this in reference to analysis? Or does it also apply to data transfer? For instance, let’s say I want to use HPC to analyze a zarr file stored on a public Google cloud bucket. (This is a different use case from the original question in the thread).

Would the general recommendation be to transfer the zarr data from Google cloud to HPC and then analyze it (because the HPC network connectivity to google cloud storage is likely worse than Google cloud compute’s connectivity to google cloud storage )?

Topic		Replies	Views
Reading larger than memory HDF data and writing concatenated xarray (or Zarr) dataset on HPC Data	13	2406	October 8, 2020
Storing CMIP6 data on JASMIN's object store Cloud	16	1749	August 21, 2020
Best practices to go from 1000s of netcdf files to analyses on a HPC cluster? HPC	45	17421	July 5, 2025
Moving ´larger` data into a Dask session HPC	1	486	February 26, 2022
Synchronizer for Zarr + Dask on Kubernetes Data	10	1845	January 16, 2024

Stream Zarr data from HPC to local machine

Related topics