What's the best file format to chose for raster imagery and masks products

Okay, it’s been some time I wanted to ask this question and get some feedbacks, and this has come up in recent discussions here, so let’s go!

On our imagery production projects (at CNES, French space agency), we often loop on this question: to which format should we write our products? This is basically CoG vs Zarr, with sometimes NetCDF. There is probably no good unique answer…

Some advantages and drawbacks I have in mind so far:

  • CoG: Well understood by remote sensing community. One file per band, chunked in each file. No parallel writes inside a file, parallel reading. Overviews (nice for visualizations).
  • Zarr: More all around. One file per band and chunk (too many files?). Parallel writes and reads. No overviews or by multiplying files?
  • Zipped Zarr: Solution to the too many files problem (that can be heavy on infrastructure), solution to how to download one product too, but I feel it is not nice.
  • NetCDF: Well understood by many communities. One file per product. Parallel reading using kerchunk?

As said elswhere, (Zipped) Zarr will be the next Sentinel 2 format. I feel there is no strong consensus and good choice on the field of file format for a collection of remote sensing products.

Maybe GeoZarr will change the game here, or Zarr v3 with sub chunks I heard about?
Maybe we got it all wrong and in the future should think more broadly as a collection, a unique Zarr store for every products?

8 Likes

I think there’s a categorical difference between Zarr and all the other formats you mentioned, and that difference relates to the conceptual difference between file and data.

In very simple terms, my conceptual description of files is packaged, portable, and complete, self-sufficient entities, but their “internals” need to be extracted in order to be read — and that’s when you get their data. Of the examples above, GeoTIFFs, NetCDFs, even zipped Zarrs are indeed conceptual files.

Zarr, however, is not. It resembles something close to a consumption-ready data stream. Its internals still use files but this is a functional property of the Zarr system and it’s typically abstracted from the end user.

Since you’re CNES, I assume that one of your use cases is sharing your products with the general public (?). The fact that you mentioned parallel reading a couple of times makes me thinking, are you exploring a data-as-a-service solution?

I get the impression that the scientific community is primarily a file-based culture. People expect files in order to build their datacube and go on with their processing — but seem to be a little reluctant to accept a ready-made datacube! Not that there are that many… perhaps it’s a vicious circle.

I’d say:

  • If you want a fail-safe and compatible solution that is not DaaS and can easily get a REST on top, that would be STAC + COGs. You can’t go wrong with those two technologies combined. Also, why rejecting multiband tiffs?
  • If you want to set up a DaaS, array-data (i.e., Zarr) is the answer I think, perhaps with a middleware like Arraylake to take care of multichunking and IO.
  • It doesn’t seem to me that NetCDF offers something better than those two solutions above, especially if they are exceeding, say, 100MB in size.
3 Likes

I think I mainly agree with your introduction. I’ll extend it a bit, Zarr is data, and so can be an entire dataset/collection.

You are right, I thought afterwards that I should have explained the use cases, so here are the main ones for end users:

  • Downloading products. It is generaly thought as “one or several products”, like several Sentinel products (files). But I think it should be extended as download a geo-temporal zone of a collection (data). Maybe that’s already two use cases, first one is the most used (but not the better ?).
  • Visualize data/products. Well, Google Map/Earth like, without having to transform them. So basically WMS protocol and the likes. There can be other visualization functionnalities, like time series over a point/area of a given variable…
  • Analyze data, at scale. With a Cloud or HPC system close to the data (HPC at CNES), Pangeo style workflow on an entire time serie on a big area (see [WIP] Add satellite image processing benchmark by jrbourbeau · Pull Request #1550 · coiled/benchmarks · GitHub typically).

Yes, I think that it’s more an habit or default. Some colleagues in research really like OpenEO like API to build their datacubes as Zarr and then using Pangeo approach to better analyze them. I think things like stackstac are also really powerfull to abstract files.

I agree, but not the choice of ESA apparently, which looks like zipped Zarr. And for multiband tiffs, I don’t know, I’ve just never seen it, maybe for download optimization purpose?

Big advantage is one file for an entire product and metadata. But I wouldn’t go this way for optical raster data or alike neither.

2 Likes

Excellently put, I have discarded a draft reply I’d started. This is way better

2 Likes

The easiest solution for us (a startup providing a product based on satellite data) is to rely on STAC + COG. It’s easy to handle and, from my perspective, covers almost all the use cases.
Retrieving data (by selecting the relevant pixels) can be done using odc.stac and rioxarray.
Visualization can be achieved by relying on titiler, as well as by directly adding the URL into QGIS (which works great for quick visualization), and large-scale processing can easily be handled using coiled (or HPC).

However, we haven’t succeeded in making it work for one use case:
→ Retrieving the time series of all Sentinel-2 data over more than 60,000 points.
We used xvec (a pretty awesome library), but it was still too slow…

I’m not sure if zarr would be a better candidate for this use case.

Regarding ML, I don’t know if batching COG (with xbatch) would have the same capabilities as zarr. I think not, but I’m not sure how many users will try it.

1 Like

@geynard

CoG: Well understood by remote sensing community. One file per band, chunked in each file. No parallel writes inside a file, parallel reading. Overviews (nice for visualizations).

odc-geo includes a parallel write method for generating COGs from Dask backed arrays.

https://odc-geo.readthedocs.io/en/latest/_api/odc.geo.cog.save_cog_with_dask.html#odc.geo.cog.save_cog_with_dask

It’s capable of writing large outputs directly to S3, and does so in a single pass over data, including overview generation while following all the internal file structure constraints of the COG. It is not limited to a single machine either, both compression and writing out to disk or S3 are done concurrently taking advantage of available compute in the cluster.

4 Likes

Very interesting discussion! I guess you are aware that this question is currently kicked around at ESA for their what they call the EO processing Framework (EOPF) for Copernicus. It seems that Zarr is the format of choice but some implementation questions (e.g. pyramids) are still open, not sure to what extent zipping is applied, more info is here: Product Structure and Format Definition — EOPF - Core Python Modules. It would be extremely useful to receive as much feed-back as possible also from people outside the Copernicus bubble on these imminent decisions which will shape the way Petabytes of Sentinel Data will be thrown at us for the rest of this decade at least.
There will be a workshop end of November at ESA-ESRIN which is probably one of the last occasions to provide feed-back which has an impact on this important scheme.

2 Likes

Do you mean that you are able to write one COG file in parallel on S3 or on a standard file system with Dask and Xarray? How does that work? This would change the conclusion of the EOPF definition trade-offs shared by @strobpr.

This has come up several times, so yes. At CNES we had a bit of contact several months/years ago, but we lost track of the advances here. Thanks or the link to the current version! I was really happy at seing an earlier version of it, as I share most of the objectives of this new format. But the thing is I currently don’t see any clear answer, at least for Sentinel 2 SAFE like optical products.

The zarr format stores a block in a file and can be read and written with block-wise parallelisation. It is currently considered the default format for EOPF product items because it supports both reading and writing from cloud storage.

The zipped zarr is a container format for the multi-file zarr that packages the zarr files into a single non-compressed zip file. This can be transferred, and it can also be stream-accessed in cloud storage (or local storage) for reading without unzipping. It is expected to be a preferred transfer format.

Do you have more information about this event?

1 Like

I’ve heard about this project and rumors that it was going to use Zarr. What is the best way for our community (which contains the core developers of this technology) to provide such feedback?

2 Likes

Do you mean that you are able to write one COG file in parallel on S3 or on a standard file system with Dask and Xarray? How does that work?

Correct, odc-geo compresses every TIFF block separately. It also constructs Dask graphs for overview images and compresses those too, although one can also supply overviews as independent inputs. Compressed blocks are then concatenated in the order required by the COG format and written out to a final location using multi-part upload functionality of S3. Multi-part upload on AWS supports splitting any object into up to 10k ordered blocks, provided each block is at least 5MiB in size. There is some implementation complexity coming from the interaction of laziness of Dask with minimal block size requirement, but it works without forcing evaluation order or forcing persist calls. It is possible to create COGs that are larger than total available cluster memory, as we are flushing compressed bytes to S3 as we go.

1 Like