Stream Zarr data from HPC to local machine

Hi folks,

I have generated a Zarr store (~10 TB) on my university’s HPC system, and I would like to do some analysis on the dataset. My university has a Jupyterhub set up, but it doesn’t always play nicely with Dask-style computations. I have been able to come up with some simple-enough workarounds while on the HPC system, but I was wondering: is there a way to stream the data from my Zarr store onto my local computer? I have seen a bunch of demos where people stream from AWS, but I haven’t found any regarding HPC systems.

You may find GitHub - xarray-contrib/xpublish: Publish Xarray Datasets via a REST API. to be useful for this kind of use case.

Tutorial: Tutorial — Xpublish 0.1.0.post14 documentation

3 Likes

My previous reply presumes that this is a zarr store that can be read with xarray. If you are working with raw zarr store, I am not sure xpublish can handle this use case.

This is amazing, thank you Anderson! I am indeed using xarray to handle this analysis. I’ll check this out and follow up if I run into any major roadblocks.

Some more primitive/experimental solutions that might work for you:

  • the jupyter filesystem of fsspec, which allows your local process to see whatever your remote jupyter kernel sees (whole files at a time), assuming the remote kernel has all the files on its local file system. You’ll need your jupyter token.
  • the dask filesystem, if you run a localcluster in hpc, you can view any filesystem on a worker.

You may also want to run an Intake server, which can transmit zarr data natively, but needs you to set up a prescription for your datasets.

2 Likes

The key question here is what kind of network connectivity exists between your HPC system and your local machine. The reason we have so many examples loading data from AWS S3 (and similar cloud object stores) is that these storage systems have extremely high bandwidth to their colocated cloud computing regions. If you HPC system is sitting behind a standard network connection, that will limit how fast you can get data out. HPCs systems are generally high security, so you may not be able to access the HPC storage from outside the cluster. Furthermore, even if you can resolve these problems, the HPC system disk may not be able to deliver high read throughput. You should talk to your HPC system administrator to find out more.

In general, if you are working with HPC, we strongly recommend using the HPC itself for your Pangeo workloads. There are lots of people doing this. There are many good solutions for deploying dask on HPC (Dask Jobqueue, DaskMPI). Your time is probably better spent figuring out how to get these working, rather than trying to get the data out of your HPC system.

2 Likes

Thanks @martindurant and @rabernat ! I’ve been busy converting a bunch of netCDF data to Zarr this week, but once I’m done, I’m going to try some of these approaches.

This is good to know about. While the university-maintained/“public” Jupyterhub can be a pain to work with, my analysis code works great if I launch a SLURM batch job and tunnel into that node. However, this approach means that I need to burn my computational budget and, more frustratingly, potentially wait hours for the batch job to run. I don’t know what the throughput between HPC and my personal computer will look like (or if I will run into security issues), but I’m curious to see. In the end, it might not work out, and I will return to grappling with analysis on the HPC system itself :slight_smile:

You could use Jupyter-forward to deal with the port forwarding and tunneling into compute nodes in a Jupyter environment from your local machine! I’ve found this to be helpful when not wanting to deal with the main JupyterHub directly and installing custom environments.

3 Likes

Thanks so much for the help and the informative thread. "In general, if you are working with HPC, we strongly recommend using the HPC itself for your Pangeo workloads. "

Is this in reference to analysis? Or does it also apply to data transfer? For instance, let’s say I want to use HPC to analyze a zarr file stored on a public Google cloud bucket. (This is a different use case from the original question in the thread).

Would the general recommendation be to transfer the zarr data from Google cloud to HPC and then analyze it (because the HPC network connectivity to google cloud storage is likely worse than Google cloud compute’s connectivity to google cloud storage )?