I am developing a spatially distributed hydrological model which can report output on an hourly time-step. The output grids can be large in terms of number of rows and columns; i.e., 5000 rows x 5000 columns is not unusual. So for each hourly time-step I need to write (or store in memory) a grid of 5000 x 5000 (dtype=float32). You can imagine this becomes an enormous amount of storage (in memory = not possible, or on disk) if the model is run for example for 20 years: 20 * 365 * 24 = 175,200 grids of 5000 rows x 5000 columns. This becomes even worse if I have 20 or more variables to report. Storing it in memory is not possible cause it already blows before one month is processed. Also, on top of the hourly reporting, I do like to aggregate the hourly grids to daily, monthly, and annual grids, and write this to netcdf as well.
First I thought I write each variable as a xr.Dataset to a netcdf file for each hourly time-step, and then open them again with xr.open_mfdataset after the files for one month have been written to disk, and then do some aggregation to days, months with it. However, this all results in out-of-memory issues. Also the writing of large datasets to netcdf takes a very long time.
Does anyone have any advice / suggestions on how-to do this in a smart / proper way? That would be much appreciated.
Cheers,
Wilco