What's the best file format to chose for raster imagery and masks products

geynard · October 1, 2024, 7:00am

Okay, it’s been some time I wanted to ask this question and get some feedbacks, and this has come up in recent discussions here, so let’s go!

On our imagery production projects (at CNES, French space agency), we often loop on this question: to which format should we write our products? This is basically CoG vs Zarr, with sometimes NetCDF. There is probably no good unique answer…

Some advantages and drawbacks I have in mind so far:

CoG: Well understood by remote sensing community. One file per band, chunked in each file. No parallel writes inside a file, parallel reading. Overviews (nice for visualizations).
Zarr: More all around. One file per band and chunk (too many files?). Parallel writes and reads. No overviews or by multiplying files?
Zipped Zarr: Solution to the too many files problem (that can be heavy on infrastructure), solution to how to download one product too, but I feel it is not nice.
NetCDF: Well understood by many communities. One file per product. Parallel reading using kerchunk?

As said elswhere, (Zipped) Zarr will be the next Sentinel 2 format. I feel there is no strong consensus and good choice on the field of file format for a collection of remote sensing products.

Maybe GeoZarr will change the game here, or Zarr v3 with sub chunks I heard about?
Maybe we got it all wrong and in the future should think more broadly as a collection, a unique Zarr store for every products?

sotosoul · October 1, 2024, 6:23pm

I think there’s a categorical difference between Zarr and all the other formats you mentioned, and that difference relates to the conceptual difference between file and data.

In very simple terms, my conceptual description of files is packaged, portable, and complete, self-sufficient entities, but their “internals” need to be extracted in order to be read — and that’s when you get their data. Of the examples above, GeoTIFFs, NetCDFs, even zipped Zarrs are indeed conceptual files.

Zarr, however, is not. It resembles something close to a consumption-ready data stream. Its internals still use files but this is a functional property of the Zarr system and it’s typically abstracted from the end user.

Since you’re CNES, I assume that one of your use cases is sharing your products with the general public (?). The fact that you mentioned parallel reading a couple of times makes me thinking, are you exploring a data-as-a-service solution?

I get the impression that the scientific community is primarily a file-based culture. People expect files in order to build their datacube and go on with their processing — but seem to be a little reluctant to accept a ready-made datacube! Not that there are that many… perhaps it’s a vicious circle.

I’d say:

If you want a fail-safe and compatible solution that is not DaaS and can easily get a REST on top, that would be STAC + COGs. You can’t go wrong with those two technologies combined. Also, why rejecting multiband tiffs?
If you want to set up a DaaS, array-data (i.e., Zarr) is the answer I think, perhaps with a middleware like Arraylake to take care of multichunking and IO.
It doesn’t seem to me that NetCDF offers something better than those two solutions above, especially if they are exceeding, say, 100MB in size.

geynard · October 1, 2024, 6:47pm

I think I mainly agree with your introduction. I’ll extend it a bit, Zarr is data, and so can be an entire dataset/collection.

You are right, I thought afterwards that I should have explained the use cases, so here are the main ones for end users:

Downloading products. It is generaly thought as “one or several products”, like several Sentinel products (files). But I think it should be extended as download a geo-temporal zone of a collection (data). Maybe that’s already two use cases, first one is the most used (but not the better ?).
Visualize data/products. Well, Google Map/Earth like, without having to transform them. So basically WMS protocol and the likes. There can be other visualization functionnalities, like time series over a point/area of a given variable…
Analyze data, at scale. With a Cloud or HPC system close to the data (HPC at CNES), Pangeo style workflow on an entire time serie on a big area (see [WIP] Add satellite image processing benchmark by jrbourbeau · Pull Request #1550 · coiled/benchmarks · GitHub typically).

Yes, I think that it’s more an habit or default. Some colleagues in research really like OpenEO like API to build their datacubes as Zarr and then using Pangeo approach to better analyze them. I think things like stackstac are also really powerfull to abstract files.

I agree, but not the choice of ESA apparently, which looks like zipped Zarr. And for multiband tiffs, I don’t know, I’ve just never seen it, maybe for download optimization purpose?

Big advantage is one file for an entire product and metadata. But I wouldn’t go this way for optical raster data or alike neither.

Michael_Sumner · October 1, 2024, 11:12pm

Excellently put, I have discarded a draft reply I’d started. This is way better

Basile_Goussard · October 2, 2024, 3:13pm

The easiest solution for us (a startup providing a product based on satellite data) is to rely on STAC + COG. It’s easy to handle and, from my perspective, covers almost all the use cases.
Retrieving data (by selecting the relevant pixels) can be done using odc.stac and rioxarray.
Visualization can be achieved by relying on titiler, as well as by directly adding the URL into QGIS (which works great for quick visualization), and large-scale processing can easily be handled using coiled (or HPC).

However, we haven’t succeeded in making it work for one use case:
→ Retrieving the time series of all Sentinel-2 data over more than 60,000 points.
We used xvec (a pretty awesome library), but it was still too slow…

I’m not sure if zarr would be a better candidate for this use case.

Regarding ML, I don’t know if batching COG (with xbatch) would have the same capabilities as zarr. I think not, but I’m not sure how many users will try it.

kirill.kzb · October 2, 2024, 11:31pm

@geynard

CoG: Well understood by remote sensing community. One file per band, chunked in each file. No parallel writes inside a file, parallel reading. Overviews (nice for visualizations).

odc-geo includes a parallel write method for generating COGs from Dask backed arrays.

https://odc-geo.readthedocs.io/en/latest/_api/odc.geo.cog.save_cog_with_dask.html#odc.geo.cog.save_cog_with_dask

It’s capable of writing large outputs directly to S3, and does so in a single pass over data, including overview generation while following all the internal file structure constraints of the COG. It is not limited to a single machine either, both compression and writing out to disk or S3 are done concurrently taking advantage of available compute in the cluster.

strobpr · October 4, 2024, 1:58pm

Very interesting discussion! I guess you are aware that this question is currently kicked around at ESA for their what they call the EO processing Framework (EOPF) for Copernicus. It seems that Zarr is the format of choice but some implementation questions (e.g. pyramids) are still open, not sure to what extent zipping is applied, more info is here: Product Structure and Format Definition — EOPF - Core Python Modules. It would be extremely useful to receive as much feed-back as possible also from people outside the Copernicus bubble on these imminent decisions which will shape the way Petabytes of Sentinel Data will be thrown at us for the rest of this decade at least.
There will be a workshop end of November at ESA-ESRIN which is probably one of the last occasions to provide feed-back which has an impact on this important scheme.

geynard · October 4, 2024, 8:29pm

Do you mean that you are able to write one COG file in parallel on S3 or on a standard file system with Dask and Xarray? How does that work? This would change the conclusion of the EOPF definition trade-offs shared by @strobpr.

This has come up several times, so yes. At CNES we had a bit of contact several months/years ago, but we lost track of the advances here. Thanks or the link to the current version! I was really happy at seing an earlier version of it, as I share most of the objectives of this new format. But the thing is I currently don’t see any clear answer, at least for Sentinel 2 SAFE like optical products.

The zarr format stores a block in a file and can be read and written with block-wise parallelisation. It is currently considered the default format for EOPF product items because it supports both reading and writing from cloud storage.

The zipped zarr is a container format for the multi-file zarr that packages the zarr files into a single non-compressed zip file. This can be transferred, and it can also be stream-accessed in cloud storage (or local storage) for reading without unzipping. It is expected to be a preferred transfer format.

Do you have more information about this event?

rabernat · October 4, 2024, 8:54pm

I’ve heard about this project and rumors that it was going to use Zarr. What is the best way for our community (which contains the core developers of this technology) to provide such feedback?

kirill.kzb · October 7, 2024, 5:33am

Do you mean that you are able to write one COG file in parallel on S3 or on a standard file system with Dask and Xarray? How does that work?

Correct, odc-geo compresses every TIFF block separately. It also constructs Dask graphs for overview images and compresses those too, although one can also supply overviews as independent inputs. Compressed blocks are then concatenated in the order required by the COG format and written out to a final location using multi-part upload functionality of S3. Multi-part upload on AWS supports splitting any object into up to 10k ordered blocks, provided each block is at least 5MiB in size. There is some implementation complexity coming from the interaction of laziness of Dask with minimal block size requirement, but it works without forcing evaluation order or forcing persist calls. It is possible to create COGs that are larger than total available cluster memory, as we are flushing compressed bytes to S3 as we go.

strobpr · November 3, 2024, 10:40am

Apologies for the late reply! I copy below a mail I received from ESA (Jolyon.Martin@esa.int). Some info should also be available here: https://eoframework.esa.int/x/yYEWAQ (link didn’t work when just tested - hope this is temporary). Contact Jolyon or Angela.Lombardi@ext.esa.int for latest info.

Dear All

At the workshop on STAC we held before the summer https://eoframework.esa.int/display/SSR/STAC+Workshop±+June+6%2C+2024 we suggested a follow on to be arranged in the autumn. We would therefore like to propose a “save the date” for Tuesday 26 November for this event, we did a preliminary check that there are no major Copernicus happenings at this time so I hope this works well.

From an ESA perspective we would be pleased to present the progress on STAC for the Sentinels data, and the extended use of the STAC within the various elements of the Copernicus Ground Segment. We would also take the opportunity to share a more in-depth look at the work being performed by ESA in preparation of a new common product format to be deployed across the Sentinels, and Copernicus Expansion missions in the future, using ZARR.

As a very rough planning for this second workshop I would propose the morning dedicated more to STAC and the afternoon more on ZARR. For ESA’s part we will send out an information package to allow you all to take a look at the work performed to date and bring any questions, suggestions or comments. Potentially we would also look at hosting a STAC sprint next year, and so some discussion on that at the workshop would be welcome.

The information shared at the last workshop was very welcome, and perhaps this time there would be specific themes that the attendees would like to focus. Please let me know if there is any specific topic you would like to present or discuss so that we can compile a more detailed agenda, we would welcome any input/presentation on the STAC and ZARR ecosystem.

I am using the email list for those informed of the last workshop, please feel free to forward this notice to whomever you feel could be interested. Due to the building work ongoing in ESRIN we are a bit limited in hosting people on-site, so this time the meeting would be teams only. Please let Angela know if you would like to be included or dropped from the email list.

Kind regards

Jolyon Martin

RichardScottOZ · November 6, 2024, 10:05pm

Zarr zip driver is slower too when I have tried it.

I am considering recommendations on this for proprietary format conversion as well - geophysics grids as such, but there will be other types.

Some of the similar things I have been writing on confluence.

Geoscientists understand geotiffs - at least at a basic level, not a data level - and can use directly.

nunomir · November 21, 2024, 6:37am

It seems that the zarr + stac workshop will be postponed to next year. Likely in February.

csaybar · January 5, 2025, 12:01pm

Unfortunately, I’m unable to add links directly as I’m a new user, I don’t have the necessary privileges. However, I’ve outlined below the reasons why I believe Zarr+ZIP is not a good idea:

I hope this helps reconsider some of the decisions regarding the use of the ZIP format. It explains why the ZIP format will always be slow.

Michael_Sumner · February 27, 2025, 4:06am

excellent work, and sorry to have missed this … 100% agree this is a disaster and I hope others are speaking up against it

emmanuelmathot · March 5, 2025, 5:33pm

I just spent the last two days at a workshop organized by ESA, where the roadmap for Zarr adoption within the EOPF framework was presented. Below, I will summarize the key takeaways relevant to this discussion until official information is officially released in April:

EOPF (Earth Observation Processing Framework) is the abstraction framework for the new Copernicus Space Component (CSC) Data Processor Re-engineering. It serves as a data model that abstracts product items such as measurements, quality, annotations, and attributes. In theory, this framework would allow for encoding products in multiple containers, including Zarr, GeoTIFF, NetCDF, and SAFE.

The selected baseline data encoding for all missions is Zarr, chosen for various justified reasons such as scalability and cost efficiency. The definition of the specifications initiative began in 2024 and will continue until 2026, when ESA is expected to start operationally producing the new Zarr from the ground segment.

By that time, iterations will need to be made regarding the Zarr specification and partitioning. A key point confirmed by all ESA technical officers is that there will be no Zipped Zarr in the final implementation . The Zipped Zarr format is only a temporary workaround for managing the enormous number of files generated by the combination of the EOPF data model, partitioning, and Zarr structure for a single data product. This is where contributions like sharding will be welcome.

The Sample Service, developed and operated by EODC, will serve as the central point of reference (official opening by April) for all community activities around EOPF Zarr. More information on the Sample Service can be found here: https://zarr.eopf.copernicus.eu/

Additionally, there will be a STAC + Zarr workshop on April 17th. More information to come.

ESA also aims to implement a Discrete Global Grid System (DGGS) for data mapping to reduce current data duplication, which can be as high as 30% due to Sentinel-2’s current UTM gridding.

It is important to note that this initiative is still underway, and ESA has several key matters to address in collaboration with the community before they can begin producing Sentinel data in Zarr on a permanent basis. The key takeaway is that ESA’s current focus is on promoting community engagement and encouraging user adoption, which is a welcomed effort.

c.defranchis · March 9, 2025, 6:41pm

Hi Emmanuel,

Thank you for sharing these takeaways! Do you know if the roadmap includes a plan for reprocessing the entire Sentinel-1/2/3 archives to the new format?

Christoph_Reimer · March 12, 2025, 9:30pm

I’m happy to announce that the first webinar related to the EOPF Zarr data format will take place on the 27 March 2025 at 11:00 CET. Register now via the given Webinar Registration - Zoom link. Join the webinar and contribute to the specification of this new format. Looking forward to a fruitful discussion.

tinaok · March 14, 2025, 5:17am

Hi c.defrancis,
As I understand, eventually yes, all data will be converted.

tinaok · April 11, 2025, 8:43am

Hi everyone,
(may be we should start a new thread specific to updates on EOPF??, but I’ll put this info here;)

The STAC & Zarr Workshop will be held next Wednesday 16.04.25 10:00-16:00 (CEST) It is a online workshop, registration link is;

https://forms.oXice.com/e/JRpNqA384D

This is what I got as draft agenda ;
Agenda – STAC & Zarr Workshop – draft

Morning 10:00 to 12:00
10:00 – 10:10 - Introduction - ESA
10:10 – 10:25 – STAC in the Copernicus Data Space Ecosystem - CF
10:25 – 10:40 - openEO interface using PG STAC - Sinergise
10:40 – 10:55 - Context of Zarr in EOPF re-engineering - ESA
10:55 – 11:05 - STAC in ESA EOPF Zarr - CS
11:05 – 11:20 - EOPF Sample Service - EODC
11:20 – 11:30 - STAC Asset definition for Zarr - DLR
11:30 – 11:40 - STAC for EOPF orchestration – Airbus
11:40 – 12:00 – Open Discussion
Afternoon 14:00 to 16:00
14:00 – 14:15 - ECMWF Data Stores approach to STAC/Zarr - ECMWF
14:00 – 14:30 - Zarr for Sentinel-3 Marine Data - EUMETSAT
14:30 – 14:45 - STAC in ESA Common Services (TBC) – SPACEBAR
14:45 – 15:00 - Transforming research and operational pipelines: STAC at BEYOND - Νational Οbservatory of Αthens
15:00 – 15:15 – STAC & Zarr for Climate Data Cache on DESP – B-Open
15:30 – 15:45 - GeoZarr – Dev. Seed
15:45 – 16:00 – Open Discussion and Conclusion

Topic		Replies	Views
OPeNDAP vs. direct file access Data	32	4414	January 27, 2021
Cloud-optimized access to Sentinel-2 JPEG2000 Data	10	355	May 2, 2025
Welcome, I need some support for the design of a forecast archive with Zarr Data	10	1155	April 23, 2022
STAC and Earth Systems datasets Data	23	4936	October 24, 2022
ZCollection: a library for partitioning Zarr dataset Data	14	1318	April 11, 2022

What's the best file format to chose for raster imagery and masks products

Apologies for the late reply! I copy below a mail I received from ESA (Jolyon.Martin@esa.int). Some info should also be available here: https://eoframework.esa.int/x/yYEWAQ (link didn’t work when just tested - hope this is temporary). Contact Jolyon or Angela.Lombardi@ext.esa.int for latest info.

Related topics