New Working Group for Distributed Array Computing

TomNicholas · September 14, 2022, 7:21pm

We’re forming a new working group to explore alternative parallel computing frameworks to dask, such as apache-beam/cubed/ramba.

I propose we try a bi-weekly hour-long meeting, starting next week. If you’re interested can you please fill out this doodle poll for times that work for you.

See also:

shoyer · September 14, 2022, 10:43pm

I don’t know if I’ll be able to make an hour every other week, but I’ll try. 30 minutes would be more doable.

TomNicholas · September 15, 2022, 2:46pm

Maybe an hour was too much - lets start by meeting at the best time from the poll and see what length we think is sufficient.

rabernat · September 16, 2022, 1:25pm

The time selected is every other Monday at 1pm ET. The first meeting with be Monday, Sept. 19.

The working group has been added to the calendar and published on the Pangeo website: https://pangeo.io/meeting-notes.html

TomNicholas · September 19, 2022, 5:33pm

Notes

TomAugspurger · September 19, 2022, 6:02pm

GitHub - pangeo-data/distributed-array-examples is a repository where we’re collecting examples of challenging distributed array computations.

dcherian · September 19, 2022, 10:01pm

Ah sorry to have missed it! (the one time I don’t check notifications on a weekend!)

I added some brief comments in red to the notes.

mrocklin · September 23, 2022, 9:33pm

Good to keep us on our toes.

As a heads-up, the Dask folks are Coiled are collecting workloads for large scale benchmarking to help inform development. We’ve gotten significantly more data-driven in the last few months. I think that James Bourbeau is aware of the array examples repo Tom points to above. If anything else arises that would help to guide things in Dask-land please speak up.

Thanks all

Todd_Anderson · September 28, 2022, 12:13am

Ugh…I was expecting a notification of the selected time the same way I got the initial email so sorry I missed the first one. I’ll be there for the next one. Thanks for posting the notes.

TomNicholas · October 3, 2022, 4:13pm

Reminder that we have another meeting at 1pm EST TODAY (so in 45 minutes’ time)

TomNicholas · October 3, 2022, 8:00pm

Today we had a great intro to Ramba from Todd and Babu, and an overview of cubed from Tom White - thanks everyone!

On the 31st October we’re going to have a presentation on Arkouda from Scott Bachman!

Todd_Anderson · October 4, 2022, 12:42am

Here is the link I promised in the meeting about standardizing a partitioning API.

github.com/data-apis/consortium-feedback

Protocol for distributed data

opened 11:41AM - 06 Oct 21 UTC

fschlimb

In order for an ndarray/dataframe system to interact with a variety of framework…s in a distributed environment (such as clusters of workstations) a stable description of the distribution characteristics is needed. The [\_\_partitioned\_\_ protocol](https://github.com/IntelPython/DPPY-Spec/blob/draft/partitioned/Partitioned.md) accomplishes this by defining a structure which provides the necessary meta-information. Implementations show that it allows data exchange on between different distributed frameworks without unnecessary data transmission or even copies. The structure defines how the data is partitioned, where it is located and provides a function to access the local data. It does not define or provide any means of communication, messaging and/or resource management. It merely describes the current distribution state. In a way, this is similar to the meta-data which `dlpack` provides for exchanging data within a single node (including the local GPU) - but it does it for data which lives on more than on process/node. It complements mechanism for intra-node exchange, such as `dlpack`, `__array_interface__` and alike. The current lack of such a structure typically leads to one of the following scenarios when connecting different frameworks: * a data consumer implements a dedicated import functionality for every distributed data container it sees important enough. As an example, xgboost_ray implements a variety of data_sources. This [PR](https://github.com/ray-project/xgboost_ray/pull/153) lets xgboost_ray automatically work with any new container that supports `__partitioned__` (the extra code for modin DF and Ray/MLDataSet are no longer needed once they support it, too) * the user needs to explicitly deal with the distributed nature of the data. This either leads to unnecessary data transfer and/or developers need to understand internals of the data-container. In the latter case they get exposed to explicit parallelism/distribution while often the original intend of the producer was to hide exactly that. The implementation [here](https://github.com/fschlimb/daal4py/tree/feature/partitioned_interface) avoids that by entirely hiding the distribution features but still allowing zero-copy data exchange between `__partitioned__` (exemplified by modin ([PR](https://github.com/modin-project/modin/pull/3452)) and HeAT ([PR](https://github.com/helmholtz-analytics/heat/pull/788))) and scikit-learn-intelex/daal4py. * frameworks like dask and ray wrap established APIs (such as pytorch, xgboost etc) and ask developers to switch to the new framework/API and to adopt their programing model. A structure like `__partitioned__` enables distributed data exchange between various frameworks which can be seamless to users while avoiding unnecessary data transfer. The [\_\_partitioned\_\_ protocol](https://github.com/IntelPython/DPPY-Spec/blob/draft/partitioned/Partitioned.md) is an initial proposal, any feedback from this consortium will be highly appreciated. We would like to build a neutral, open governance that continues to oversee the evolution of the spec. For example, if some version of it receives positive response from the community we would be more than happy to donate the \_\_partitioned\_\_ protocol to the data-api consortium or host it in a clean github org.

TomNicholas · October 17, 2022, 2:51pm

Reminder that this will happen again today at 1pm EST!

TomNicholas · November 14, 2022, 3:22pm

We’re on again today at 1pm EST, when we will have a talk by Frank Schlimbach on the partitioning API!

TomNicholas · November 28, 2022, 4:19pm

Today (at 1pm EST) we have Deepak talking about his dask algorithms for groupby problems using his package flox!

TomNicholas · December 12, 2022, 3:20pm

We don’t have a speaker today, so unless someone has something they would like to discuss with the group or present last minute I think we should cancel today and pick it back up after Christmas! That would make the next meeting 9th January.

dcherian · December 12, 2022, 5:31pm

Sounds good to me.

Perhaps a good topic for next time would be “What’s needed for a full end-to-end workflow with cubed/Ramba” organized around your xarray+cubed PR?

Todd_Anderson · January 17, 2023, 5:05pm

What are the plans for this working group in the new year? @everyone

TomNicholas · January 17, 2023, 10:28pm

Hey @Todd_Anderson ! Thanks for the reminder. I got completely distracted by conferences / workshops the last few weeks.

“What’s needed for a full end-to-end workflow with cubed/Ramba” organized around your xarray+cubed PR?

I could talk about this if people are interested? Though it’s not finished, but would be useful information and should start some useful discussion as to API standards etc.

dcherian · January 18, 2023, 10:02pm

IMO this should be our goal (and a great SciPy talk )

Topic		Replies	Views
Hello Pangeo! News & Announcements	3	755	September 12, 2019
Earthcube Annual Meeting call for abstracts (due Apr. 15) News & Announcements	21	822	April 15, 2020
Dask Summit 2021 News & Announcements	36	1614	March 24, 2021
What's Next - Software - Massive Scale	7	660	December 21, 2023
Question on DASK efficiency Data	21	1120	April 29, 2022

New Working Group for Distributed Array Computing

Related topics