Cloud-optimized access to Sentinel-2 JPEG2000

Dear Pangeo community,

As you may know, Sentinel-2 rasters are distributed as JPEG2000.
If you try to crop a S2 raster remotely, you’ll find that a lot of requests will be issued, and thus the format can be quickly classified as “not cloud native”, in contrast to COG, Zarr, etc.
This might be why the format was not even mentioned in the recent discussion about raster file formats.

A weak definition of “cloud optimized” for imagery rasters could be:

  1. individual tile access, to allow partial read,
  2. easy localization of any tile, to issue a single range request to retrieve it.

Point 1. is easy to verify. Using gdalinfo, one can check that the bands have Block=NxM specified (e.g. Block=1024x1024 for S2 rasters).

Point 2. is more tricky, and requires more knowledge about the data format.
TIFF uses the tags TileOffsets and TileByteCounts, Zarr uses filenames to reflect the chunk id. For JPEG2000, there is a optional TLM field in the main header for this purpose.
Until recently, the TLM markers were not considered by OpenJPEG to optimized decoding, but this has been solved in the 2.5.3 release.

So why is remote partial reading Sentinel-2 rasters so slow? It’s because the TLM option was never enabled!
We have performed simple benchmarks (see GitHub - Kayrros/sentinel-2-jp2-tlm), and it shows the enabling the option for future products would make Sentinel-2 imagery “cloud native”, with similar performances as COG or Zarr, while requiring only a minor change to the format.

What about the archive data, which is all online on CDSE S3 but inefficient to access because of the lack of TLMs?
We developped a “TLM indexer” to pre-compute the TLM tables on all historical data (partially public here in the github above), and based on a suggestion by Even Rouault to use GDAL’s /vsisparse/, we can inject them on the fly when doing a partial read (Python package “jp2io” in development in the same github repository).

In short: highly efficient partial access to Sentinel-2 rasters is possible without converting the collections to COG or Zarr!

We are currently working on indexing the full archive and plan to make the indexes available publicly somehow, and we are also trying to get ESA to enable the TLM option for future products.

What do you think of this approach?

Also I’m not too familiar in Kerchunk/VirtualiZarr, but I believe it should be possible to make a virtual datacube of Sentinel-2, exploiting the data in JPEG2000 on CDSE with no modification + the TLM indexes. I’d love to get some feedback on this.

10 Likes

This sounds like potentially an ideal use can for the “virtual zarr” approach.

A weak definition of “cloud optimized”

Yes, this is a good definition, which fits well with the article I wrote about the topic last week :slight_smile:

We have performed simple benchmarks (see GitHub - Kayrros/sentinel-2-jp2-tlm), and it shows the enabling the option for future products would make Sentinel-2 imagery “cloud native”, with similar performances as COG or Zarr, while requiring only a minor change to the format.

Awesome! That’s very similar in spirit to the “cloud-optimized HDF” work.

I believe it should be possible to make a virtual datacube of Sentinel-2, exploiting the data in JPEG2000 on CDSE with no modification + the TLM indexes. I’d love to get some feedback on this.

Yes. If the individual files are already “cloud-optimized” at rest, the advantage of creating a virtual zarr store pointing at the data are

  • downstream applications can access the data through the general-purpose zarr API (and therefore through xarray.open_zarr)
  • the entire dataset can be addressed as a single massive datacube, rather than users having to deal with large numbers of individual filepaths.

I would love to help you with this. The two things to understand first are:

  1. Are there any other properties of the Sentinel-2 data which would make it hard to map to Zarr (see FAQ — VirtualiZarr)?
  2. Has anyone written a VirtualiZarr/Kerchunk reader for JPEG yet? If not then you might want to look at the VirtualiZarr reader for HuggingFace’s SafeTensors format as an example.
1 Like

I think it should be fine. Sentinel-2 has bands at different resolutions, but in the worst case one could create one virtual zarr per band and it would still be convenient.
Also, Sentinel-2 rasters are produced on a grid system (modified MGRS), so that all products of the same ‘MGRS’ tile already share a lot of attributes (CRS, transform, shape for example): it is already quite “cube oriented” in that sense.

Thank you for the safetensors example, that’s a simple format to understand how to integrate with VirtualiZarr.

Apparently there have been experiments about JPEG2000 codecs (GitHub - glencoesoftware/zarr-jpeg2k: Zarr JPEG-2000 codec, but also directly by imagecodecs GitHub - cgohlke/imagecodecs: Image transformation, compression, and decompression codecs.).
I’m not sure yet whether these codecs assume a complete jp2 stream per chunk (header + encoded data) or just encoded data + external codec settings. Because the ‘virtual chunk’ would directly reference existing encoded data in JPEG2000 files, we need the second option. In any case I see no real blockers here, in the worst case we can directly bind to the relevant OpenJPEG internal decoding functions for example.

1 Like

Hi @j.anger, this is a game-changer idea!

It solves all interoperability issues and enables cloud-optimized access to the existing Sentinel-2 SAFE archive (what people/software already know). Many public and private buckets with Sentinel-2 L1C and L2A data are in SAFE, and being able to do partial reads would be a dream. Plus, it would also be the cheapest solution by far, with no reprocessing/changes required.

GDAL recently implemented kerchunk references. I didn’t explore it yet, but it might be interesting to take a look.

3 Likes

Thanks for the feedback!

Quick update on the zarr-virtualization of Sentinel-2 imagery:

I’ve developped a small codec, that concatenates a J2K codestream header to the fetched chunk and uses imagecodecs.jpeg2k_decode to decode the tile.

Then, using the TLM indexes that we have + VirtualiZarr, one can produce a virtual dataset, and concatenate bands and dates into a large cube (eg. below, 1353 dates x 3 bands = 2TB, still virtual). Then it can be exported to kerchunk for later use. The virtual dataset references JPEG2000 rasters on CDSE S3.
Of course using the cube requires the custom codec, and the generated kerchunk file probably won’t be compatible with GDAL for a while for this reason (or with a custom GDAL driver), but the goal here is mostly to show that reprocessing the full archive to Zarr is not required.

<xarray.Dataset> Size: 2TB
Dimensions:  (time: 1353, y: 10980, x: 10980)
Coordinates:
  * time     (time) datetime64[ns] 11kB 2015-07-06T10:50:16 ... 2024-12-30T10...
Dimensions without coordinates: y, x
Data variables:
    B02      (time, y, x) float32 652GB ...
    B03      (time, y, x) float32 652GB ...
    B04      (time, y, x) float32 652GB ...

Right now I’ve put only three bands and no attributes, but it wouldn’t be too difficult to extend that and be comparable to the EOPF format.
It is already quite convenient, and loading one chunk only requires one request, and all chunks of the same date/band are in the same file so you could do only one request to get them all too (you can’t scale much more than that).

3 Likes

This is amazing!! Can you share the code you used with VirtualiZarr?

You can find it here sentinel-2-jp2-tlm/jp2io at main · Kayrros/sentinel-2-jp2-tlm · GitHub (module jp2io.zarr, and folder zarr-demo).
I was unfamiliar with xarray, zarr v2, zarr v3 and VirtualiZarr, and as you can see I didn’t really follow the proper way to add a new backend to VirtualiZarr. In part because compared to the other backend in VirtualiZarr, I already have a parquet that gives me the offset of tiles for many rasters (the TLM parquet index) and not just one S2 product, so I wasn’t sure how to put things together with respect to the recommended interface. This would be reworked.

There are also many TODOs that should be fixed to move to a more future-proof codec for example.

In terms of performances, I’ve put a notebook to play with a large cube but “it” (dask? xarray? kerchunk?) seems to struggle with the size of the kerchunk (~150MB json).
Also keep in mind that CDSE S3 for general users is limited to 4 connections in parallel, I’m not sure how the demo handles that currently (I’m not using a general user account).

Having done this experiment, a nice evolution of the Sentinel-2 collection might be to have kerchunk.json-like file in each SAFE product (and referenced from the STAC), and the user could concatenate them through time or through space as they wish. It would be quite easy to add such small file to the existing 50PB collection.

Sorry - we’re just finishing a significant refactor of all of this, so the proper way is not yet clearly documented. However it looks like you created a ManifestStore directly, which is the recommended way going forward.

In terms of performances, I’ve put a notebook to play with a large cube but “it” (dask? xarray? kerchunk?) seems to struggle with the size of the kerchunk (~150MB json).

It’s almost certainly the Kerchunk json format that’s the bottleneck here. JSON is simply an extremely inefficient format for storing chunk references. Using VirtualiZarr and saving to Icechunk instead (no kerchunk involved) I can comfortably write references for millions of chunks. (The Kerchunk Parquet format would also probably work better.)

have kerchunk.json -like file in each SAFE product (and referenced from the STAC), and the user could concatenate them through time or through space as they wish.

Interesting. I think we should discuss alternative ways to provide a similar user experience.

50PB collection

With 50PB of data you can’t afford to use an inefficient format to store the chunk references! How many individual chunks are there in those 50PB?

Do you know if Icechunk supports providing a custom codec? Or does it delegates the bytes decoding/decompression to another library?

Indeed Kerchunk Parquet is much better at this scale, it’s now much faster!

But no one will try to virtualize the full collection as a single array (in particular because it has different CRS), right? So they can be distributed by parts (per product, or per MGRS tile, etc)

Considering only the four 10 meters bands, with one product = 484 chunks and assuming around 50M products for a given product level, that’s around 24B chunks. A good proportion are empty (products on the border of the swath). The other bands have a bit less chunks.
We are currently indexing the L1C/L2A collections, the chunks for all bands for all products of a given MGRS tile is around 5MB in a parquet file (and I’m storing the path to the rasters in addition to the TLM). Overall it will be a few 100s GB, which is small enough.

In this sense Icechunk is just Zarr. So yes it supports providing a custom codec. The bytes decoding/decompression is done by the Zarr reader, for example Zarr-Python, so works exactly the same way as for non-Icechunk Zarr.

Indeed Kerchunk Parquet is much better at this scale, it’s now much faster!

It’s not hard to do better than JSON for this. I’m very curious how this performs with Icechunk too.

But no one will try to virtualize the full collection as a single array (in particular because it has different CRS), right? So they can be distributed by parts (per product, or per MGRS tile, etc)

Maybe not as a single array, but you might want to try to distribute the data as one Zarr store. That means the whole dataset is one URL, and allows analysis code to work with a complete store.

that’s around 24B chunks.

That’s a lot of chunks, but potentially supportable. How small are these chunks though?

Thanks, I thought it reimplemented its own codec pipeline in Rust, but now I understand better where it positions itself. Indeed it works!

From a very small test, I see a x2 improvement over kerchunk parquet.

I see. I didn’t have that in mind, but indeed it might be the most convenient way to distribute the chunks to Zarr.

From my understanding, ESA (through EOPF) is currently planning to change the distribution format to Zarr, and has not yet decided on how to deal with archive data (but the goal is to make it available as Zarr). I guess an icechunk store would fill this gap and provide a transparent “export” of existing JPEG2000s as Zarr, without reprocessing the data.
In that sense, Zarr should not be thought as a file format, but as a protocol.

However, this also allows for something that would be a lot less disruptive for the community: keep the SAFE format as storage format, and maintain an icechunk store, for both archive and live products.

Here’s an histogram on a subset: