Any interest in using Ray?

rganti · September 23, 2021, 2:13pm

Hi,

I see that this community is heavily interested in Dask, has anyone looked at using Ray (ray.io) and its use from Pangeo’s perspective? Also, here is a blog of the benefits of Ray over Dask for various tasks - Benchmarks: Dask Distributed vs. Ray for Dask Workloads.

Would love to hear the feedback.

Best,
Raghu

rabernat · September 23, 2021, 6:24pm

Thanks for the question and welcome to the forum!

Ray has certainly been on our radar. However, many (most?) Pangeo workflows involve nd-array data, which aligns great with the dask.Array module. AFAIK Ray only works with tabular (i.e. dataframe) style data.

I read your blog post. It was not exactly clear to me what you are doing. Your workflow seems to create dask arrays within dask tasks:

Load

Download file from S3 and load each file into 2D np.ndarray of type np.float32

Convert np.ndarray into dask.array of chunksize 0.25GB. Concatenate the dask.array into one dask.array.

Compute some metadata & wrap the dask.array + metadata in an xaray.Dataset

If you have already loaded the data into memory (np.ndarray), there is absolutely no reason to go back to dask array. That’s just going to slow down your computation.

Then you say this:

We end up with ~1.86 million 8MB zarr (xarray) files, totaling ~14.88 Terabytes

This means that we’ll be computing 1.86 million of the above graphs.

This really does not align with the way we use Dask in Pangeo. For a big computation, we typically just have one Zarr store and one task graph. This is an example of a typical workflow: http://gallery.pangeo.io/repos/pangeo-gallery/physical-oceanography/01_sea-surface-height.html
I’d love to see how to do the same thing with Ray.

So overall my impression is that you have shared a biased benchmark.

rganti · September 24, 2021, 4:36pm

Thanks @rabernat for your prompt response. Actually, my goal is to get a clear understanding of the Dask and Ray intersection and when to use what. I am not a Dask user and also unaware of the typical workflows, so really appreciate you sharing the example workflow that you have here. Also, the blog is something that I came across, not my own views (which I am trying to build based on conversations with key community members here).

Looking at the workflow you shared, the key feature that you are leveraging from Dask is the xarray and transformations on them (e.g., mean, transpose). Do you have an example where you are using Dask in combination with a ML or AI component (e.g., sklearn, tensorflow, pytorch).

Blockquote
Ray has certainly been on our radar. However, many (most?) Pangeo workflows involve nd-array data, which aligns great with the dask.Array module. AFAIK Ray only works with tabular (i.e. dataframe) style data.

What aspects of Ray are on your radar? Also, Ray does not support dataframe (that is modin) and is primarily object store based.

Topic		Replies	Views
Large-scale data processing benchmarks for Xarray-Beam	6	1565	June 13, 2022
Pangeo Showcase: "Dask Array: Scaling Up for Terabyte-Level Performance" (April 9, 2025 at 12 PM ET) Pangeo Showcase	3	227	April 9, 2025
Exploring Pangeo's Data Processing Capabilities for Large-Scale Climate Modeling! Data	0	63	January 31, 2025
Is this a use case for Pangeo? Science	0	374	September 27, 2021
Feedback on Zarr performance benchmarking HPC	1	1145	July 16, 2020

Any interest in using Ray?

Related topics