Assessment tools for lossy compression of geoscientific data
With data volumes increasing dramatically from both observations and numerical models, compression is increasingly necessary. In the past, compression was not used at all, but lossless compression methods have now become the norm. However, lossless compression methods have their limits. Lossy compression methods have the potential for much greater storage savings, but too much lossy compression can lead to undesirable features in the data. Recent work has shown that the amount and type of lossy compression that is optimal for geoscientific data depends on the kind of data and how the data is to be used. A wealth of statistical analyses can be performed on the data to inform the user how much lossy compression can be applied to their data, and in recent years, useful measures have been investigated and implemented in a variety of different packages.
We plan to take some research code that performs statistical analyses on data, comparing the original data to its compressed-decompressed values. We plan to scope out what code can be used immediately in Pangeo-like workflows and package that code in a reusable way.
Anticipated Data Needs
We will need access to testing data that exists on Glade at NWSC.
Anticipated Software Tools
We will work to integrate the existing research code into the Pangeo ecosystem, primarily focusing on interoperability with xarray data structures.
We are looking for xarray users and experts, as well as compression, data format (notably netcdf and zarr), and statistical expertise.