Arkouda as an alternative for working with massive arrays?

I recently stumbled across a nice article explaining why climate models are written in Fortran. At the end of the article, the author mentions a new library called Arkouda, which is pitched as NumPy-like arrays at massive scale backed by the programming language Chapel: “Arkouda is not trying to replace Pandas but to allow for some Pandas-style operation at a much larger scale.”

This kind of stuff is way outside my area of expertise, but Arkouda could be a project for the Pangeo community to keep an eye on?

1 Like

Chapel in general has been a great project with lots of cool ideas… it acknowledges that it draws a lot of inspiration from Python.

Seems like there’s been a ton of effort to support Linux environments, for the longest time only a small subset of the language really worked off of Cray machines. But it might be really instructive to compare an Arkouda-backed analysis with a comparable dask one. Possibly a bit apple to oranges but would be a great conversation starter for engagement between the communities!

I did look into using Arkouda as a backing for Xarray for a bit. The biggest hurdle is that the support for n-dim arrays is just now coming along. It’s still young but Arkouda will see more development in the near future so it could be a good time to look into using it for some things at large scales.

I believe the Arkouda folks did some Arkouda vs Dask comparisons with very large (10+ Tb) datasets and found that it performed much better than Dask at very large scales. Dask, however, is much more fully featured at this point in time. If people here would like to see this I can reach out and tell them to post something here.

Chapel is pretty great and relatively underrated. It runs well on Cray and non-Cray environments. I run it on my Mac, linux box, and small cluster (headless ubuntu nodes) all the time. The dev team is incredibly kind and will try to help with whatever you need. They did field some requests, from people who might be a part of this community, about NetCDF and Fortran inter-op support somewhat recently.

Full disclaimer: Ive contributed to Arkouda and Chapel and I wrote the article that is mentioned here. Thanks for reading!

1 Like

Speaking for the Chapel community (and as an Arkouda contributor), we’ve been interested in doing an Arkouda vs. Dask performance / scalability comparison for quite some time, but have been reluctant to publish the results we’ve obtained to date because (a) we’re admittedly not Dask experts and (b) we’re obviously not unbiased. But if someone else would be interested in running such a comparison, we’d be happy to provide support and tips on the Chapel/Arkouda side of things. (Of course, we’re also happy to provide support and tips on Chapel and Arkouda to anyone who’s interested in them for any purpose).

And then one slight correction:

I’m not sure what you’re thinking of here, but Chapel’s full feature set has been portable to all systems for all time (modulo bugs). Since the project’s inception, we’ve done most of our development and testing on non-Cray systems because cycles are more readily available. A more accurate characterization might be that we haven’t been as focused on Chapel performance on non-Cray systems for most of the project’s lifetime, but that has shifted recently as we’ve looked increasingly at improvements for IB-based systems.

-Brad

1 Like

Hey @bradcray, thanks for the input here, and I very much appreciate the correction! I played with Chapel on a Linux system I used to use for my core research back after my PhD… but come to think of it that’s been nearly 4 years ago! I’m sure any problems that I ran into at the time were just PEBKAC.

Super excited to see the Chapel and Arkouda projects continue to grow!

2 Likes

My understanding is that Dask and Arkouda are doing pretty different things, so a direct head-to-head shootout might not be the most productive way forward. Instead, we should seek to understand when to choose one over the other.

From reading the Arkouda docs, the biggest different I see is this.

  • Arkouda is an in memory distributed array. It uses the distributed memory of many HPC nodes to hold arrays bigger than a single node’s memory. It’s computations are performed eagerly and synchronously. This is the way most HPC applications (e.g. climate models) work.
  • Dask Arrays are lazy (e.g. generated via delayed execution / callbacks), and Dask computations are represented as a directed acyclic graphs. Computations are executed by passing a graph to a distributed scheduler, where it can be run asynchronously across a cluster of workers. This execution paradigm is less familiar to many HPC users.

We often run into problems with Dask when we create graphs that are too big / complex to be executed by our scheduler and workers. With Arkouda, that would be impossible (AFAIK) because it holds everything in memory and operates eagerly. On the other hand, Arkouda will not be able to handle cases where the aggregate array memory exceeds the cluster memory, whereas Dask has no problem with this.

@bradcray - I agree with your concerns about biased benchmarks. I always point people to this great blog post by @mrocklin about that:

https://matthewrocklin.com/blog/work/2017/03/09/biased-benchmarks

Arkouda will not be able to handle cases where the aggregate array memory exceeds the cluster memory, whereas Dask has no problem with this.

@rabernat: I’m interested in confirming something that I’ve only heard second-hand: Is it true that Dask array indices (and therefore sizes) are limited to 32-bit integers?

And then for my curiosity: What size arrays do you work with (or want to work with) in practice? And is the straining point for your computations typically the overall size of the arrays, or having large numbers of arrays? (or both?)

I ask because I’ve yet to run into use cases where a cluster’s memory size has been a limiting factor for array size in Arkouda/Chapel (though, granted, we tend to have access to fairly large systems at Cray/HPE; but then again, so do most people nowadays, via the cloud). And if the 32-bit limit is real, it seems that a user would typically hit that limit long before they’d run out of cluster memory?

With Arkouda, that would be impossible (AFAIK)

With the current implementation, you’re right that things are not evaluated lazily. With effort, this could be added, but it hasn’t been a pain point for Arkouda’s user community that I’m aware of.

While I understand the principles of lazy evaluation, I’m not seeing what benefit it’s getting you here vs. eager evaluation. Can you clarify, maybe through an example, what the value-add is for your use cases?

1 Like

We typically work with arrays of order 10 TB. In Pangeo Cloud, we use spot pricing and elastic scaling to spin up clusters of ~50 nodes with ~10 GB memory each. So Dask allows us to process these large arrays in a cost-effective way with much less cluster memory than the total array size.

Throwing more memory at the problem of course makes things easier / faster! In our experience, Dask performance is usually great when the array size is much smaller than the cluster memory.

Lazy evaluation goes hand-in-hand with the memory constraint. If you can’t fit all your data in memory, you need lazy / delayed evaluation and smart task scheduling. The DAG-based representation of the computation also allows for optimizations. But I agree, if you can just fit all your data in memory, then this additional complexity is unnecessary and probably won’t improve performance compared to the eager case.

There are lots of Pangeo examples on our gallery page: http://gallery.pangeo.io/. If you’re interested in learning about our use cases, I invite you to check them out for yourself! All examples are coupled with Binder so you can try them out live.

1 Like

To clarify, I wasn’t asking about end-user use-cases of Pangeo so much as a sample computational pattern in Pangeo that would concretely illustrate how Dask’s lazy approach benefits you, to help me get my head around the contrast you were describing above.

For example, a typical pattern in Arkouda might be something like: read a 10TB multi-column data set into memory from disk, sort the arrays in terms of one of the columns, and then capture all of the rows that share the maximal value for that column into a smaller array for use in Python. For a computation like this, it’s difficult for me to see how lazy evaluation would help anything since sorting is such a global operation by nature—whether I do it lazily or eagerly, I’m going to need to compute with all of the data to get the result and move on to the next step. I can imagine eager implementations that would be smart about sorting out-of-core data sets by bringing chunks of data in and out of the disks in a coordinated manner, so I don’t see laziness as being intrinsically required for data sets that exceed physical memory sizes.

So that made me curious what a typical Pangeo computational pattern looks like, in order to understand how it differs from this case and illustrates the benefit you’re getting from lazy evaluation. E.g., what is it about your computations’ requirements that permit them to leverage laziness, such as needing to compute with only a subset of the data or delaying computation to the point when the data is needed, if ever (the two benefits I most strongly associate with lazy evaluation).

Thanks,
-Brad

PS – I’m still curious about the following question as well, if a Dask expert here could verify what I’ve been told:

Is it true that Dask array indices (and therefore sizes) are limited to 32-bit integers?

Ok I understand your question now. Indeed the “full sort” operation you described is very hard to do lazily.

A typical Pangeo use case involves multi-dimensional arrays, like this (From https://catalog.pangeo.io/browse/master/ocean/CESM_POP/CESM_POP_hires_control/):

(The row-column paradigm you described in your example is what we usually refer to as “tabular data”; that would be handled by dask.dataframe, which provides a pandas-compatible API, rather than dask.array.)

Usually we are reducing the data in some way. The simplest example computation is to just take the mean, along certain dimensions, e.g. to produce a timeseries

data.mean(dim=('lon', 'lat'))

More complicated examples would be

  • take two such arrays and compute the pointwise covariance along the time axis
  • calculate the Fourier power spectrum along one or more dimension and average over the others
  • calculate joint histograms between multiple arrays (with xhistogram) and reduce along one ore more axes

I hope that’s helpful. I appreciate your engagement on this forum; it’s always nice to have an outside perspective.

I can’t comment on this. My assumption is that Dask indexing is the same as Numpy’s. As noted above, we tend to work with multidimensional arrays, so we can create very large arrays without coming anywhere close to this limit on any single axis.

2 Likes

I just checked this

import dask.array as dsa
max32_bit_int = 2_147_483_647
dsa.ones(shape=(8 * max32_bit_int,), chunks=(max32_bit_int // 256))

No problem for Dask.

2 Likes