Moving ´larger` data into a Dask session

rlourenco · February 18, 2022, 5:33pm

Hi everyone! I want to get some experience from you.

I have been recently digging into the documentation of Pangeo and got really interested in the usage of STAC endpoints, to feed Dask arrays.

I wonder how this may be an issue when doing this on a HPC cluster, in which the worker nodes do not have access to the internet. In this case, using a STAC endpoint needs to be done, let´s say, in a cloud env., for retrieval purposes, and then, locally, moving the corpus into the cluster filesystem, and then load into Dask?

I think this because even if the cluster had internet access, it would take a lot of time to move data in (unless dealing with a Globus endpoint, but still paying some price on this Input).

geynard · February 26, 2022, 1:43pm

Hi @rlourenco,

So you seem to be talking of existing STAC endpoints accessible from Internet, not local ones into your HPC facility.

I’m not sure I’m getting your question correctly.

If you need data from Internet but your compute nodes cannot access it, you’ll need to find a way to get this data from another server within your local network that can access this data, and download it from there. No matter if it is exposed as STAC endpoints or other.

It’s also true that sometimes it is better to sync all the data you need locally where your computing power is, especially if you access it frequently. But it also depends on the speed of the HPC outside network, the STAC endpoints performance and the volume of data you want to read.

Topic		Replies	Views
Stream Zarr data from HPC to local machine HPC	8	961	March 12, 2024
Cloud computing using NASA Earthdata with Earthdata login Cloud	29	2280	April 18, 2023
Xarray loading data locally when Dask is distributed Data	3	512	February 24, 2022
Compute time series for 70,000 locations (Speed up the processing)	12	313	October 7, 2024
Is this a use case for Pangeo? Science	0	375	September 27, 2021

Moving ´larger` data into a Dask session

Related topics