Model [geotiff] postprocessing at scale - what would you do?


Model output storage format for processing, analysis and visualisation.


Australia-wide models at 400m.
Call it 10000 x 9500 for simplicity, in pixels.
All models on the same grid and projection.


Over 100, so far, this will probably grow.


Currently geotiff.


Approximately 0.5 TB each.

Stored s3://bucket/folder/time [e.g. when the model was run]

Each model

Each of the 6 below have different data, even if the same variable names.

  1. 2 44 band outputs that will always be the same - the first set of variables [you could concatenate these, singly or together]
  2. 1 variable band output from 5 to 30 - a second set of variables - 5 being a subset of the possible 30. [can’t concatenate]
  3. 2 groups of 44 variable band outputs 3 to 15 - a third set of variables. Each of the 44 containers is the same name as in 1. [can’t concatenate]
  4. 1 group of variable band outputs 3 to 15 - a third set of variables. [can’t concatenate]
  5. 16 variable band output from 5 to 30 - same second set of variables, used in arriving at 1. and 2. [can’t concatenate]
  6. 44 x 4 variable band output from 5 to 30 - same second set of variables, with names from 1, used in arriving at 1. and 2. [can’t concatenate]


Distributional type reductions - mean, max, etc. for users. Other more esoteric model analysis for me.


  1. What format to convert to
  2. How to break down
  3. How to combine - most efficiently/cost effectively and useably.

I’m sure you know, but kerchunking geotiff would be something I’d like to see here, perhaps to compare against whatever conversion/rechunking you choose to try.

Thanks Martin… worth a look for sure. If I follow…I had not read all of that discussion until now, this would be a convert to cog case.

Creating a virtual zarr datacube from COG s3 objects · Issue #125 · cgohlke/tifffile · GitHub … trialling along the lines of Martin’s suggestion

Or, how would you chunk this?

Dummy placeholder :

some subset of ‘grapes’ and ‘model’ will have data in a given time, sometimes all of them.

Or the other way, which may be logistically ad-hoc more useable given the region,append functions of to_zarr is to have a ‘long’ dataset

With 1300 odd variables that are basically just time, y, x - so all the same.

Any, the 1300 odd variable version is now a thing!