Assessment tools for lossy compression of geoscientific data
Scientific Motivation
With data volumes increasing dramatically from both observations and numerical models, compression is increasingly necessary. In the past, compression was not used at all, but lossless compression methods have now become the norm. However, lossless compression methods have their limits. Lossy compression methods have the potential for much greater storage savings, but too much lossy compression can lead to undesirable features in the data. Recent work has shown that the amount and type of lossy compression that is optimal for geoscientific data depends on the kind of data and how the data is to be used. A wealth of statistical analyses can be performed on the data to inform the user how much lossy compression can be applied to their data, and in recent years, useful measures have been investigated and implemented in a variety of different packages.
Proposed Hacking
We plan to take some research code that performs statistical analyses on data, comparing the original data to its compressed-decompressed values. We plan to scope out what code can be used immediately in Pangeo-like workflows and package that code in a reusable way.
Anticipated Data Needs
We will need access to testing data that exists on Glade at NWSC.
Anticipated Software Tools
We will work to integrate the existing research code into the Pangeo ecosystem, primarily focusing on interoperability with xarray data structures.
Desired Collaborators
We are looking for xarray users and experts, as well as compression, data format (notably netcdf and zarr), and statistical expertise.
Our thought is to implement the statistical testing metrics that tell you if the lossy compression filter that you might apply to the data will result in any statistical artifacts. So, it is compressor agnostic, in that sense.
However, a lot of our work in lossy compression is more sophisticated than just Quantize. Currently, our recent work is in collaboration with LLNL and Peter Lindstrom’s zfp compressor, which is specially designed for N-dimensional floating point arrays.
There is something similar to Quantize for NetCDF Giles, too. Can’t remember how to use it, though.