How to efficiently merge 500 Zarr files with overlapping coordinates using Dask on S3?

G’Day Pangeo Community, I’m hoping for some suggestions about how to efficiently combine approx 500 zarr files into a single DataArray, so I can then feed that DataArray into odc.geo.xr.write_cog.

I’ve tried a number of Xarray methods that either result in an error or are slow and therefore seems to be heading down the wrong track. For example, combine_by_coords throws errors due to the overlapping x and y coordinates between zarr files and concat doesnt seem to work in a distributed workflow like mfdataset.

A lot of my workflows use coiled.io for the handling of cloud resources and dask to distribute the computing. So I’m hoping to undertake this process in a distributed workflow to speed things up.

Below is an example of my individual zarr files, they all have a similar structure. A number of the zarr files will have overlapping boundaries that result in duplicate x, y and z values.

<xarray.Dataset> Size: 24MB
Dimensions: (z: 1, y: 4181, x: 1409)
Coordinates:
spatial_ref int64 8B …

  • x (x) float64 11kB 141.0 141.0 141.0 141.0 … 141.0 141.0 141.0
  • y (y) float64 33kB -40.99 -40.99 -40.99 … -41.16 -41.16 -41.16
  • z (z) int64 8B 1
    Data variables:
    bathymetry (z, y, x) float32 24MB dask.array<chunksize=(1, 512, 512), meta=np.ndarray>

Any help would be amazing! I’m still learning the ropes with this kind of distributed geospatial programming.

Cheers,
Thomas
BathyMaps Australia

I know this is not what you asked for, but I would cast every source as a regular raster for GDAL (you might need some ‘vrt://{address.zarr}?a_srs=EPSG:4326’ kinds of augmenting), and pass them all as one list to osgeo.gdal.Warp(destNameOrDestDS = "mywarp.vrt", [list, of, sources] ) then investigate the bbox and resolution and crs of that resulting virtual file, tweak those as arguments to the Warp call as you like (pixel resolution, pixel alignment, actual bbox extent, details of the target crs are all candidates) and then change the same conversion using COG driver options rather than VRT, and warp process options.

I can’t imagine any other Zarr wrangling being much fun. Happy to flesh this out if of interest. 500 is not very many, is 1409*4181 representative of the entire set? Is there an underlying alignment or is that all arbitrary per source? Does overlapping mean warping is really required, or is this (just) a case of mosacing sources that already line up abstractly?

How are these overlapping? Is there some structure to the overlapping you can exploit? And importantly, how are you deduplicating the data at overlapping points?

Hey @Michael_Sumner and @dcherian thanks for the replies guys.

The output will be a mosaic, I was using the term merge, but Michael prompt is correct, a mosaic is the better term to use.

You can see from the below image the overlap is set as 0.1 degrees on all sides to help prevent any missing tiled regions in our apps that allows users to download subset regions for offline use. However, our webmap performs better when the regions are mosaiced and the browser requests tiles from a master file than many subset regions.

The overlapping duplicate data should theoretically be the same, so I can simply apply a “first” approach to eliminate duplicates.

I actually took Michael suggestion and applied it to geotiff files to create a mosaic but I’m still interested to get this working with my raw zarr files.

it will work fine with Zarr, local or remote, urls (vsicurl), or S3 (vsis3) or whatever (vsi-x) - you just might need to augment the gdal Open() with some light VRT (vrt:// is the easiest way to do this). The ZARR driver analyses the coordinate arrays and will generate a regular-grid transform from those if that’s the situation - or use the coordinate arrays in a warp operation.

But, since they do form a mosaic, and I assume they’re all the same crs and resolution (hence no need for warp and you can ignore the crs) perhaps running this directly is the easiest:

Using the cli is a good way to test a few inputs on a draft output, but all of this is available via the API ( in C++, or osgeo.gdal in Python, gdalraster in R, etc).

1 Like

Note that with ZARR you really need careful construction of the dsn:

"ZARR:\"/vsicurl/https://path/to/dir.zarr\""

and strictly, because Zarr is a container format probably the sub dataset also explicitly, with a prepended “:sdsname” but info on the above will list what’s available :ok_hand: there’s always subtleties between drivers (netcdf for example has workarounds for the quoting that arguably are not appropriate)

Auth is set with config arguments, as always I’m happy to explore, offline if need be :folded_hands:

The parallelization aspect can be determined once the basics of interpretation are sorted out. Write to VRT first as a default so you can check the output lazily, before commitment to pixel churn

1 Like

The overlapping duplicate data should theoretically be the same, so I can simply apply a “first” approach to eliminate duplicates.

Alternatively, could you crop each individual tile to eliminate overlaps, and simply concatenate together? This is a far simpler mosaic to construct.

I created overlapping tiles from GEBCO 2024 as 435 ZARR datasets, using R here fyi, each tile is 3552x3552 (the full tiles, some are slightly smaller depending on the dangle).

the tiling and tile-dangle looks like this - without the overlap - (red is data extent, grey boxes are the tiles):

then I constructed a compressed COG with the warper (I will modify this to use the api in R and Python). With 32 cpus it took about 2min (4.4Gb COG, ZSTD compressed, 512x512 blocks)


gdalwarp --config GDAL_NUM_THREADS=ALL_CPUS zarr/*.zarr warp.tif -te -180 -90 180 90 -ts 86400 43200 -t_srs EPSG:4326 -co BIGTIFF=YES -co COMPRESS=ZSTD

I’ve set the target grid specification explicitly rather than have it derived from the Zarr coordinates but that’s perhaps not necessary, just good practice to get a tight and accurate georeferencing in the COG.

When there’s overlapping pixel it’s “last in wins”, and could be nuanced by analysing the tiles as they are inserted to the COG or cropping as Deepak suggests, this warp process can be unpacked and done in a more modular way if that’s needed.

I know this is OT because there’s not dask etc but it was a fun exercise for me, and more efficient than I expected. In R we’re using mirai for async processing, in GDAL warp it’s internal to that library. If any of it is on interest I’m happy to follow up.

Hey @Michael_Sumner and @dcherian for your responses. @Michael_Sumner your responses are so detailed, thank you! I’m going to circle back to this work soon and test out suggestions. I will report back and let you know how I go guys.