Thanks for the response Tom. Some comments:
open_mfdataset
I don’t think it’s enough to point people to Zarr. I’m pretty committed to also making reading tons of NetCDF files in the cloud fast. My sense is that this is actually pretty doable today if we parallelize the metadata reads open_mfdataset(..., parallel=True)
but then merging that metadata today performs poorly. @jrbourbeau has been working on this a bit recently.
Query Optimization
A lot of dask-expr has nothing to do with Dask. It has to do with writing down a symbolic expression system that matches the semantics of a common library (like pandas or numpy). We then add Dask protocols to these objects so that they can operate well in a Dask context. That work is pretty orthogonal though. I think that you could probably achieve what you wanted in dask-expr if you cared to.
Watch out though! There are hard algorithmic problems here. For example we’ll need to figure out how to pass slicing operations through broadcasting operations. It’s not for the faint of heart.
Bounded Memory Rechunking
As I mentioned above, I think it’s probably time for Dask core devs to talk to Xarray folks about where we are today. I’ve mentioned this above in this thread and have already raised a Coiled-internal issue about a webinar on this topic.
This team is at an offsite this week (integrating dask-expr into dask.dataframe) and I’m guessing that they’ll be focused on dask.dataframe into early Q1, but later in Q1 could be a good time for this I think.
Benchmarking
Openness
Just to be clear, even though coiled benchmarks has “coiled” in the name it’s pretty open. Benchmarks are triggered by Github Actions, source is open, people outside of Coiled have write permissions, it benchmarks projects that aren’t Dask (like DuckDB, Polars, Spark), etc… There’s nothing to stop it from growing.
If you want to benchmark something like Cubed on AWS Lambda someone will have to pay for that though. We’d put those credentials somewhere in a Github Actions Secret.
Determining benchmarks
Cool, everyone is saying “I’d love to see a shared set of benchmarks”. How can we make this happen? Do you personally have time to do some work here? Maybe @dcherian from Earthmover? Maybe someone else?
Coiled folks are happy to do some work here (we already have some benchmarks pulled from various github repos) but we’re not experts here. This really has to come from you all. If you have time to devote towards this that would be very welcome (although of course you have no obligation to do so).
Merge Groups
Sure? What is the concrete step that’s being proposed here? Are you asking someone in particular to attend a regular meeting? (if so, my personal answer is “no, but I’m happy to talk ad-hoc any time when things come up”)
I care about distributed arrays, and I’m happy to devote resources of those around me to this effort. However, I also need some things in return (like people to make benchmarks, maybe someone to help flesh out the symbolic part of dask_expr.array
. I’m very open to any mutually beneficial collaboration.
My personal experience is that working groups tend to be more about talking and less about working. With that in mind I’m more motivated to collaborate on specific work products.