Any interest in using Ray?

Hi,

I see that this community is heavily interested in Dask, has anyone looked at using Ray (ray.io) and its use from Pangeo’s perspective? Also, here is a blog of the benefits of Ray over Dask for various tasks - Benchmarks: Dask Distributed vs. Ray for Dask Workloads.

Would love to hear the feedback.

Best,
Raghu

2 Likes

Thanks for the question and welcome to the forum!

Ray has certainly been on our radar. However, many (most?) Pangeo workflows involve nd-array data, which aligns great with the dask.Array module. AFAIK Ray only works with tabular (i.e. dataframe) style data.

I read your blog post. It was not exactly clear to me what you are doing. Your workflow seems to create dask arrays within dask tasks:

  • Load
    • Download file from S3 and load each file into 2D np.ndarray of type np.float32
    • Convert np.ndarray into dask.array of chunksize 0.25GB. Concatenate the dask.array into one dask.array.
    • Compute some metadata & wrap the dask.array + metadata in an xaray.Dataset

If you have already loaded the data into memory (np.ndarray), there is absolutely no reason to go back to dask array. That’s just going to slow down your computation.

Then you say this:

  • We end up with ~1.86 million 8MB zarr (xarray) files, totaling ~14.88 Terabytes
  • This means that we’ll be computing 1.86 million of the above graphs.

This really does not align with the way we use Dask in Pangeo. For a big computation, we typically just have one Zarr store and one task graph. This is an example of a typical workflow: http://gallery.pangeo.io/repos/pangeo-gallery/physical-oceanography/01_sea-surface-height.html
I’d love to see how to do the same thing with Ray.

So overall my impression is that you have shared a biased benchmark.

1 Like

Thanks @rabernat for your prompt response. Actually, my goal is to get a clear understanding of the Dask and Ray intersection and when to use what. I am not a Dask user and also unaware of the typical workflows, so really appreciate you sharing the example workflow that you have here. Also, the blog is something that I came across, not my own views (which I am trying to build based on conversations with key community members here).

Looking at the workflow you shared, the key feature that you are leveraging from Dask is the xarray and transformations on them (e.g., mean, transpose). Do you have an example where you are using Dask in combination with a ML or AI component (e.g., sklearn, tensorflow, pytorch).

Blockquote
Ray has certainly been on our radar. However, many (most?) Pangeo workflows involve nd-array data, which aligns great with the dask.Array module. AFAIK Ray only works with tabular (i.e. dataframe) style data.

What aspects of Ray are on your radar? Also, Ray does not support dataframe (that is modin) and is primarily object store based.