Assessment tools for lossy compression of geoscientific data

Assessment tools for lossy compression of geoscientific data

Scientific Motivation

With data volumes increasing dramatically from both observations and numerical models, compression is increasingly necessary. In the past, compression was not used at all, but lossless compression methods have now become the norm. However, lossless compression methods have their limits. Lossy compression methods have the potential for much greater storage savings, but too much lossy compression can lead to undesirable features in the data. Recent work has shown that the amount and type of lossy compression that is optimal for geoscientific data depends on the kind of data and how the data is to be used. A wealth of statistical analyses can be performed on the data to inform the user how much lossy compression can be applied to their data, and in recent years, useful measures have been investigated and implemented in a variety of different packages.

Proposed Hacking

We plan to take some research code that performs statistical analyses on data, comparing the original data to its compressed-decompressed values. We plan to scope out what code can be used immediately in Pangeo-like workflows and package that code in a reusable way.

Anticipated Data Needs

We will need access to testing data that exists on Glade at NWSC.

Anticipated Software Tools

We will work to integrate the existing research code into the Pangeo ecosystem, primarily focusing on interoperability with xarray data structures.

Desired Collaborators

We are looking for xarray users and experts, as well as compression, data format (notably netcdf and zarr), and statistical expertise.

3 Likes

I like this project idea, and would love to join in on this one!

@kmpaul, is the existing research code publicly available?

I think it should be, but we need to ask Allison for more info. It might be scattered about, too.

And, obviously, welcome aboard!

1 Like

Great project!
What kind of compression are we talking about, is it just reducing the precision of floating point data (like in https://zarr.readthedocs.io/en/v2.1.2/api/codecs.html?highlight=string#zarr.codecs.Quantize), or something similar to JPEG for geoscientific data?

All thanks forwarded to Allison. :smiley:

Our thought is to implement the statistical testing metrics that tell you if the lossy compression filter that you might apply to the data will result in any statistical artifacts. So, it is compressor agnostic, in that sense.

However, a lot of our work in lossy compression is more sophisticated than just Quantize. Currently, our recent work is in collaboration with LLNL and Peter Lindstrom’s zfp compressor, which is specially designed for N-dimensional floating point arrays.

There is something similar to Quantize for NetCDF Giles, too. Can’t remember how to use it, though.

There is also Limited Error Raster Compression: https://github.com/Esri/lerc
It would be good to have in Zarr by the way.

ZFP is definitely the way to go here!

Indeed, it seems quite nice. Reading https://github.com/zarr-developers/numcodecs/issues/117, it seems it is not usable in Zarr yet?

No. I don’t think so. But there is work to make it possible. I need to check in on that work at some point.

I setup a github repo for our work (ldc = lossy data compression):