I see that this community is heavily interested in Dask, has anyone looked at using Ray (ray.io) and its use from Pangeo’s perspective? Also, here is a blog of the benefits of Ray over Dask for various tasks - Benchmarks: Dask Distributed vs. Ray for Dask Workloads.
Ray has certainly been on our radar. However, many (most?) Pangeo workflows involve nd-array data, which aligns great with the dask.Array module. AFAIK Ray only works with tabular (i.e. dataframe) style data.
I read your blog post. It was not exactly clear to me what you are doing. Your workflow seems to create dask arrays within dask tasks:
Load
Download file from S3 and load each file into 2D np.ndarray of type np.float32
Convert np.ndarray into dask.array of chunksize 0.25GB. Concatenate the dask.array into one dask.array.
Compute some metadata & wrap the dask.array + metadata in an xaray.Dataset
If you have already loaded the data into memory (np.ndarray), there is absolutely no reason to go back to dask array. That’s just going to slow down your computation.
Then you say this:
We end up with ~1.86 million 8MB zarr (xarray) files, totaling ~14.88 Terabytes
This means that we’ll be computing 1.86 million of the above graphs.
Thanks @rabernat for your prompt response. Actually, my goal is to get a clear understanding of the Dask and Ray intersection and when to use what. I am not a Dask user and also unaware of the typical workflows, so really appreciate you sharing the example workflow that you have here. Also, the blog is something that I came across, not my own views (which I am trying to build based on conversations with key community members here).
Looking at the workflow you shared, the key feature that you are leveraging from Dask is the xarray and transformations on them (e.g., mean, transpose). Do you have an example where you are using Dask in combination with a ML or AI component (e.g., sklearn, tensorflow, pytorch).
Blockquote
Ray has certainly been on our radar. However, many (most?) Pangeo workflows involve nd-array data, which aligns great with the dask.Array module. AFAIK Ray only works with tabular (i.e. dataframe) style data.
What aspects of Ray are on your radar? Also, Ray does not support dataframe (that is modin) and is primarily object store based.