Best practices to go from 1000s of netcdf files to analyses on a HPC cluster?

@rabernat Thank you for checking in. I want to get back to this issue as soon as possible but some other projects have taken precedence for now. I appreciate the community effort that went into the rechunker. I’ll let you know as soon as I get back to it!

1 Like

Btw, I just ran across this pywren example :- pywren-workshops/Landsat_NDVI_Timeseries.ipynb at master · aws-samples/pywren-workshops · GitHub

2 Likes

I’ve been playing around with using Apache Beam for these sort of rechunking tasks recently. It seems to work fairly well.

One thing that has come up is the need for “multi-stage” rechunking, where more than a single set of intermediates is written out to disk. This helps avoid the need to write very small chunks of data. Unfortunately, it does seems to need irregular chunks for the temporary arrays.

If anyone is curious you can find most of my current progress in this Rechunker pull request:

and the separate “Xarray-Beam” package:

1 Like