Moving ´larger` data into a Dask session

Hi everyone! I want to get some experience from you.

I have been recently digging into the documentation of Pangeo and got really interested in the usage of STAC endpoints, to feed Dask arrays.

I wonder how this may be an issue when doing this on a HPC cluster, in which the worker nodes do not have access to the internet. In this case, using a STAC endpoint needs to be done, let´s say, in a cloud env., for retrieval purposes, and then, locally, moving the corpus into the cluster filesystem, and then load into Dask?

I think this because even if the cluster had internet access, it would take a lot of time to move data in (unless dealing with a Globus endpoint, but still paying some price on this Input).

Hi @rlourenco,

So you seem to be talking of existing STAC endpoints accessible from Internet, not local ones into your HPC facility.

I’m not sure I’m getting your question correctly.

If you need data from Internet but your compute nodes cannot access it, you’ll need to find a way to get this data from another server within your local network that can access this data, and download it from there. No matter if it is exposed as STAC endpoints or other.

It’s also true that sometimes it is better to sync all the data you need locally where your computing power is, especially if you access it frequently. But it also depends on the speed of the HPC outside network, the STAC endpoints performance and the volume of data you want to read.

1 Like