We’re forming a new working group to explore alternative parallel computing frameworks to dask , such as apache-beam/cubed/ramba.
I propose we try a bi-weekly hour-long meeting, starting next week . If you’re interested can you please fill out this doodle poll for times that work for you.
See also:
6 Likes
shoyer
September 14, 2022, 10:43pm
2
I don’t know if I’ll be able to make an hour every other week, but I’ll try. 30 minutes would be more doable.
Maybe an hour was too much - lets start by meeting at the best time from the poll and see what length we think is sufficient.
1 Like
The time selected is every other Monday at 1pm ET . The first meeting with be Monday, Sept. 19 .
The working group has been added to the calendar and published on the Pangeo website: Meeting Schedule and Notes — Pangeo documentation
GitHub - pangeo-data/distributed-array-examples is a repository where we’re collecting examples of challenging distributed array computations.
1 Like
Ah sorry to have missed it! (the one time I don’t check notifications on a weekend!)
I added some brief comments in red to the notes.
1 Like
Good to keep us on our toes.
As a heads-up, the Dask folks are Coiled are collecting workloads for large scale benchmarking to help inform development. We’ve gotten significantly more data-driven in the last few months. I think that James Bourbeau is aware of the array examples repo Tom points to above. If anything else arises that would help to guide things in Dask-land please speak up.
Thanks all
3 Likes
Ugh…I was expecting a notification of the selected time the same way I got the initial email so sorry I missed the first one. I’ll be there for the next one. Thanks for posting the notes.
1 Like
Reminder that we have another meeting at 1pm EST TODAY (so in 45 minutes’ time)
Today we had a great intro to Ramba from Todd and Babu, and an overview of cubed from Tom White - thanks everyone!
On the 31st October we’re going to have a presentation on Arkouda from Scott Bachman!
1 Like
Here is the link I promised in the meeting about standardizing a partitioning API.
opened 11:41AM - 06 Oct 21 UTC
In order for an ndarray/dataframe system to interact with a variety of framework… s in a distributed environment (such as clusters of workstations) a stable description of the distribution characteristics is needed.
The [\_\_partitioned\_\_ protocol](https://github.com/IntelPython/DPPY-Spec/blob/draft/partitioned/Partitioned.md) accomplishes this by defining a structure which provides the necessary meta-information. Implementations show that it allows data exchange on between different distributed frameworks without unnecessary data transmission or even copies.
The structure defines how the data is partitioned, where it is located and provides a function to access the local data. It does not define or provide any means of communication, messaging and/or resource management. It merely describes the current distribution state.
In a way, this is similar to the meta-data which `dlpack` provides for exchanging data within a single node (including the local GPU) - but it does it for data which lives on more than on process/node. It complements mechanism for intra-node exchange, such as `dlpack`, `__array_interface__` and alike.
The current lack of such a structure typically leads to one of the following scenarios when connecting different frameworks:
* a data consumer implements a dedicated import functionality for every distributed data container it sees important enough. As an example, xgboost_ray implements a variety of data_sources.
This [PR](https://github.com/ray-project/xgboost_ray/pull/153) lets xgboost_ray automatically work with any new container that supports `__partitioned__` (the extra code for modin DF and Ray/MLDataSet are no longer needed once they support it, too)
* the user needs to explicitly deal with the distributed nature of the data. This either leads to unnecessary data transfer and/or developers need to understand internals of the data-container. In the latter case they get exposed to explicit parallelism/distribution while often the original intend of the producer was to hide exactly that.
The implementation [here](https://github.com/fschlimb/daal4py/tree/feature/partitioned_interface) avoids that by entirely hiding the distribution features but still allowing zero-copy data exchange between `__partitioned__` (exemplified by modin ([PR](https://github.com/modin-project/modin/pull/3452)) and HeAT ([PR](https://github.com/helmholtz-analytics/heat/pull/788))) and scikit-learn-intelex/daal4py.
* frameworks like dask and ray wrap established APIs (such as pytorch, xgboost etc) and ask developers to switch to the new framework/API and to adopt their programing model.
A structure like `__partitioned__` enables distributed data exchange between various frameworks which can be seamless to users while avoiding unnecessary data transfer.
The [\_\_partitioned\_\_ protocol](https://github.com/IntelPython/DPPY-Spec/blob/draft/partitioned/Partitioned.md) is an initial proposal, any feedback from this consortium will be highly appreciated. We would like to build a neutral, open governance that continues to oversee the evolution of the spec. For example, if some version of it receives positive response from the community we would be more than happy to donate the \_\_partitioned\_\_ protocol to the data-api consortium or host it in a clean github org.
1 Like
Reminder that this will happen again today at 1pm EST!
1 Like
We’re on again today at 1pm EST, when we will have a talk by Frank Schlimbach on the partitioning API!
1 Like
Today (at 1pm EST) we have Deepak talking about his dask algorithms for groupby problems using his package flox !
We don’t have a speaker today, so unless someone has something they would like to discuss with the group or present last minute I think we should cancel today and pick it back up after Christmas! That would make the next meeting 9th January.
Sounds good to me.
Perhaps a good topic for next time would be “What’s needed for a full end-to-end workflow with cubed/Ramba” organized around your xarray+cubed PR?
2 Likes
What are the plans for this working group in the new year? @everyone
Hey @Todd_Anderson ! Thanks for the reminder. I got completely distracted by conferences / workshops the last few weeks.
“What’s needed for a full end-to-end workflow with cubed/Ramba” organized around your xarray+cubed PR?
I could talk about this if people are interested? Though it’s not finished, but would be useful information and should start some useful discussion as to API standards etc.
1 Like
IMO this should be our goal (and a great SciPy talk )
1 Like