Cloud-optimized access to Sentinel-2 JPEG2000

Dear Pangeo community,

As you may know, Sentinel-2 rasters are distributed as JPEG2000.
If you try to crop a S2 raster remotely, you’ll find that a lot of requests will be issued, and thus the format can be quickly classified as “not cloud native”, in contrast to COG, Zarr, etc.
This might be why the format was not even mentioned in the recent discussion about raster file formats.

A weak definition of “cloud optimized” for imagery rasters could be:

  1. individual tile access, to allow partial read,
  2. easy localization of any tile, to issue a single range request to retrieve it.

Point 1. is easy to verify. Using gdalinfo, one can check that the bands have Block=NxM specified (e.g. Block=1024x1024 for S2 rasters).

Point 2. is more tricky, and requires more knowledge about the data format.
TIFF uses the tags TileOffsets and TileByteCounts, Zarr uses filenames to reflect the chunk id. For JPEG2000, there is a optional TLM field in the main header for this purpose.
Until recently, the TLM markers were not considered by OpenJPEG to optimized decoding, but this has been solved in the 2.5.3 release.

So why is remote partial reading Sentinel-2 rasters so slow? It’s because the TLM option was never enabled!
We have performed simple benchmarks (see GitHub - Kayrros/sentinel-2-jp2-tlm), and it shows the enabling the option for future products would make Sentinel-2 imagery “cloud native”, with similar performances as COG or Zarr, while requiring only a minor change to the format.

What about the archive data, which is all online on CDSE S3 but inefficient to access because of the lack of TLMs?
We developped a “TLM indexer” to pre-compute the TLM tables on all historical data (partially public here in the github above), and based on a suggestion by Even Rouault to use GDAL’s /vsisparse/, we can inject them on the fly when doing a partial read (Python package “jp2io” in development in the same github repository).

In short: highly efficient partial access to Sentinel-2 rasters is possible without converting the collections to COG or Zarr!

We are currently working on indexing the full archive and plan to make the indexes available publicly somehow, and we are also trying to get ESA to enable the TLM option for future products.

What do you think of this approach?

Also I’m not too familiar in Kerchunk/VirtualiZarr, but I believe it should be possible to make a virtual datacube of Sentinel-2, exploiting the data in JPEG2000 on CDSE with no modification + the TLM indexes. I’d love to get some feedback on this.

8 Likes

This sounds like potentially an ideal use can for the “virtual zarr” approach.

A weak definition of “cloud optimized”

Yes, this is a good definition, which fits well with the article I wrote about the topic last week :slight_smile:

We have performed simple benchmarks (see GitHub - Kayrros/sentinel-2-jp2-tlm), and it shows the enabling the option for future products would make Sentinel-2 imagery “cloud native”, with similar performances as COG or Zarr, while requiring only a minor change to the format.

Awesome! That’s very similar in spirit to the “cloud-optimized HDF” work.

I believe it should be possible to make a virtual datacube of Sentinel-2, exploiting the data in JPEG2000 on CDSE with no modification + the TLM indexes. I’d love to get some feedback on this.

Yes. If the individual files are already “cloud-optimized” at rest, the advantage of creating a virtual zarr store pointing at the data are

  • downstream applications can access the data through the general-purpose zarr API (and therefore through xarray.open_zarr)
  • the entire dataset can be addressed as a single massive datacube, rather than users having to deal with large numbers of individual filepaths.

I would love to help you with this. The two things to understand first are:

  1. Are there any other properties of the Sentinel-2 data which would make it hard to map to Zarr (see FAQ — VirtualiZarr)?
  2. Has anyone written a VirtualiZarr/Kerchunk reader for JPEG yet? If not then you might want to look at the VirtualiZarr reader for HuggingFace’s SafeTensors format as an example.
1 Like

I think it should be fine. Sentinel-2 has bands at different resolutions, but in the worst case one could create one virtual zarr per band and it would still be convenient.
Also, Sentinel-2 rasters are produced on a grid system (modified MGRS), so that all products of the same ‘MGRS’ tile already share a lot of attributes (CRS, transform, shape for example): it is already quite “cube oriented” in that sense.

Thank you for the safetensors example, that’s a simple format to understand how to integrate with VirtualiZarr.

Apparently there have been experiments about JPEG2000 codecs (GitHub - glencoesoftware/zarr-jpeg2k: Zarr JPEG-2000 codec, but also directly by imagecodecs GitHub - cgohlke/imagecodecs: Image transformation, compression, and decompression codecs.).
I’m not sure yet whether these codecs assume a complete jp2 stream per chunk (header + encoded data) or just encoded data + external codec settings. Because the ‘virtual chunk’ would directly reference existing encoded data in JPEG2000 files, we need the second option. In any case I see no real blockers here, in the worst case we can directly bind to the relevant OpenJPEG internal decoding functions for example.

1 Like

Hi @j.anger, this is a game-changer idea!

It solves all interoperability issues and enables cloud-optimized access to the existing Sentinel-2 SAFE archive (what people/software already know). Many public and private buckets with Sentinel-2 L1C and L2A data are in SAFE, and being able to do partial reads would be a dream. Plus, it would also be the cheapest solution by far, with no reprocessing/changes required.

GDAL recently implemented kerchunk references. I didn’t explore it yet, but it might be interesting to take a look.

2 Likes