Best practices to go from 1000s of netcdf files to analyses on a HPC cluster?

Hey, @rabernat! Thanks for the ping! Yes, the PyReshaper is mine. I’m pretty much the sole author on it. It’s actually been a while since I’ve touched it. (It’s still Python 2.7!)

PyReshaper Design

The PyReshaper was designed to convert synoptic data (1 time-step of all variables in a single file) to time-series data (1 variable over all time-steps in a single file). In other words, it’s like a transpose in the “variable” (or “field”) and “time” “dimensions” (not really dimensions, just conceptually).

The algorithm is very simple, what my group referred to as “Simple Task Parallelism”. Each “Task” in this context is “the writing of a single time-series file”. Hence, using MPI, we parallelize over “output files”. In practice, it is what @rabernat referred to as the “pull” approach, above. Every MPI rank (“task”) opens and reads a single variable from every input file, and then writes the data to its own output file, like depicted in this image:

In this picture, each different color represents a different time-step, and each “slice” is a “time-slice” or “history” file containing multiple fields.

We talked about a more sophisticated algorithm, one that dedicated 1 MPI rank for each input file (i.e., open each input file only once) and 1 MPI rank for each output file, and then use MPI send/recvs to shuffle the data to its approriate destination. But since we considered this an optimization step, and since premature optimization is the root of all evil, we decided to revisit this idea later.

Interestingly, on HPC machines with a good parallel FS, we never had to implement this more advanced algorithm. The PFS dealt with the multiple simultaneous open+read+close of the same file beautifully.

By default, the PyReshaper reads the entire data array from the input file (i.e., a single “chunk”) and writes that to its appropriate time-series file. At some point, when requested by a user, I added the ability to read smaller chunks from the input file and write each chunk separately. That way memory could be controlled a bit.

Problems with the PyReshaper

We take advantage of (abuse?) the optimized performance of the filesystem to allow simultaneous opens+reads+closes of the same file. I’m certain that object storage systems can do the same thing efficiently, too. Maybe that’s not a problem, but just a design choice to not over-engineer the solution.

We greatly depend on the low-level I/O libraries, in this case netCDF4 (originally, PyNIO). We have had problems with the I/O libraries keeping data in memory until the file is closed, even if flushed, which can lead to memory blow-out. We never implemented a solution to this as the solution was always to control the number of input slices you read from (i.e., more slices leading to longer time-series files). We also many times had to run in inefficient modes, such as running only 4 MPI tasks on a 16-core node, to prevent memory blow-out. In this sense, the algorithm is fairly inflexible.

Also, the parallelism is over output files (i.e., variables in the input files). If there are not enough variables in the input files, then you cannot achieve much parallelism. We discussed implementing a more sophisticated algorithm to allow more general scaling, but again, we never found the need.

Xreshaper

When @andersy005 joined us here at NCAR, I told him about the PyReshaper and that if I had written the PyReshaper a few years later than I did, I would have written it with Xarray. So, as an early task for him, I asked him to create an Xarray “port” of the PyReshaper. It never got finished, mainly because there was no demand for it, but you can see the WIP version here:

2 Likes