Nice Deepak! I tried again with map-reduce
using my original 20 year data in chunks of {'time: 24}
. I opened it with larger chunks {'time': 480}
, which seemed to make a big dfference.
This ran in 2:30 on my 20 node dask cluster with 40GB of memory per worker! That seems amazingly fast! As you can see, the dask cluster is pretty happy.
However, you can also see that there are plenty of memory spikes reaching up to 16GB or more. I was very happy to have the memory headroom of 40GB per worker. In my experience, once the workers feel memory constraints and start spilling to disk, it’s over.
on flox. This was my first time really doing a deep dive. It’s remarkable.