The L3 altimetry data have been updated on the “pangeo-cnes” bucket. In this update, I modified the encoding settings to access the dataset more quickly. I used Zarr filters and other encoding options to significantly improve compression, and performance reading.
Since version 2 of Zarr, it is possible to apply filters to encode the data before writing it. The idea is to transform the data in order to improve compression.
In the case of altimetry data, we have a time axis, encoded in a 64-bit integer, representing the date of the measurement to the nearest microsecond.
If zarr.Delta filter is applied, the data will be transformed to store only the delta between two successive items in order to reduce entropy in the binary representation. For example, a table containing 45723 different dates for one day, contains only 133 different values after applying the filter. This gives a much more efficient compression. For the Topex mission (the time axis represents nearly 10 years of data), with this filter, we obtain a compression factor of 62.6 vs. 5 without. In other words, to read a time axis of 1.2 GB we will need to read only 16.5 MB.
It is also possible to use several filters. For example, for the other variables, I used zarr.FixedScaleOffset and
zarr.Delta filters. The first filter, compress the data using a scale factor and an offset as it is done in the CF convention and by Xarray. The advantage of using this filter is that if the data read with Dask or Zarr, will be natively decoded. These filters change the compression factor of 1.3 to 3.2. This is less impressive than for the time variable, but it allows move a storage space of 1.9 GB to 783 MB for the entire Topex mission dataset.
In short, it is very useful to play with these filters to compress the data more efficiently.