Model [geotiff] postprocessing at scale - what would you do?

THE PROBLEM

Model output storage format for processing, analysis and visualisation.

SCALE

Australia-wide models at 400m.
Call it 10000 x 9500 for simplicity, in pixels.
All models on the same grid and projection.

NUMBER

Over 100, so far, this will probably grow.

FORMAT

Currently geotiff.

DATASETS

Approximately 0.5 TB each.

Stored s3://bucket/folder/time [e.g. when the model was run]

Each model

Each of the 6 below have different data, even if the same variable names.

  1. 2 44 band outputs that will always be the same - the first set of variables [you could concatenate these, singly or together]
  2. 1 variable band output from 5 to 30 - a second set of variables - 5 being a subset of the possible 30. [can’t concatenate]
  3. 2 groups of 44 variable band outputs 3 to 15 - a third set of variables. Each of the 44 containers is the same name as in 1. [can’t concatenate]
  4. 1 group of variable band outputs 3 to 15 - a third set of variables. [can’t concatenate]
  5. 16 variable band output from 5 to 30 - same second set of variables, used in arriving at 1. and 2. [can’t concatenate]
  6. 44 x 4 variable band output from 5 to 30 - same second set of variables, with names from 1, used in arriving at 1. and 2. [can’t concatenate]

DOWNSTREAM PRODUCTS

Distributional type reductions - mean, max, etc. for users. Other more esoteric model analysis for me.

QUESTIONS

  1. What format to convert to
  2. How to break down
  3. How to combine - most efficiently/cost effectively and useably.

I’m sure you know, but kerchunking geotiff would be something I’d like to see here, perhaps to compare against whatever conversion/rechunking you choose to try.

Thanks Martin… worth a look for sure. If I follow…I had not read all of that discussion until now, this would be a convert to cog case.

Creating a virtual zarr datacube from COG s3 objects · Issue #125 · cgohlke/tifffile · GitHub … trialling along the lines of Martin’s suggestion

Or, how would you chunk this?

Dummy placeholder :

some subset of ‘grapes’ and ‘model’ will have data in a given time, sometimes all of them.

Or the other way, which may be logistically ad-hoc more useable given the region,append functions of to_zarr is to have a ‘long’ dataset

With 1300 odd variables that are basically just time, y, x - so all the same.

Any, the 1300 odd variable version is now a thing!