Assessment tools for lossy compression of geoscientific data

kmpaul · October 2, 2019, 8:42pm

Assessment tools for lossy compression of geoscientific data

Scientific Motivation

With data volumes increasing dramatically from both observations and numerical models, compression is increasingly necessary. In the past, compression was not used at all, but lossless compression methods have now become the norm. However, lossless compression methods have their limits. Lossy compression methods have the potential for much greater storage savings, but too much lossy compression can lead to undesirable features in the data. Recent work has shown that the amount and type of lossy compression that is optimal for geoscientific data depends on the kind of data and how the data is to be used. A wealth of statistical analyses can be performed on the data to inform the user how much lossy compression can be applied to their data, and in recent years, useful measures have been investigated and implemented in a variety of different packages.

Proposed Hacking

We plan to take some research code that performs statistical analyses on data, comparing the original data to its compressed-decompressed values. We plan to scope out what code can be used immediately in Pangeo-like workflows and package that code in a reusable way.

Anticipated Data Needs

We will need access to testing data that exists on Glade at NWSC.

Anticipated Software Tools

We will work to integrate the existing research code into the Pangeo ecosystem, primarily focusing on interoperability with xarray data structures.

Desired Collaborators

We are looking for xarray users and experts, as well as compression, data format (notably netcdf and zarr), and statistical expertise.

andersy005 · October 3, 2019, 1:03am

I like this project idea, and would love to join in on this one!

@kmpaul, is the existing research code publicly available?

kmpaul · October 3, 2019, 7:55pm

I think it should be, but we need to ask Allison for more info. It might be scattered about, too.

kmpaul · October 3, 2019, 7:55pm

And, obviously, welcome aboard!

davidbrochart · October 3, 2019, 9:51pm

Great project!
What kind of compression are we talking about, is it just reducing the precision of floating point data (like in https://zarr.readthedocs.io/en/v2.1.2/api/codecs.html?highlight=string#zarr.codecs.Quantize), or something similar to JPEG for geoscientific data?

kmpaul · October 3, 2019, 10:09pm

All thanks forwarded to Allison.

Our thought is to implement the statistical testing metrics that tell you if the lossy compression filter that you might apply to the data will result in any statistical artifacts. So, it is compressor agnostic, in that sense.

However, a lot of our work in lossy compression is more sophisticated than just Quantize. Currently, our recent work is in collaboration with LLNL and Peter Lindstrom’s zfp compressor, which is specially designed for N-dimensional floating point arrays.

There is something similar to Quantize for NetCDF Giles, too. Can’t remember how to use it, though.

davidbrochart · October 3, 2019, 10:18pm

There is also Limited Error Raster Compression: https://github.com/Esri/lerc
It would be good to have in Zarr by the way.

rabernat · October 3, 2019, 10:19pm

ZFP is definitely the way to go here!

davidbrochart · October 3, 2019, 10:46pm

Indeed, it seems quite nice. Reading https://github.com/zarr-developers/numcodecs/issues/117, it seems it is not usable in Zarr yet?

kmpaul · October 3, 2019, 11:07pm

No. I don’t think so. But there is work to make it possible. I need to check in on that work at some point.

allibco · October 16, 2019, 5:20pm

I setup a github repo for our work (ldc = lossy data compression):

Topic		Replies	Views
Pangeo Showcase: "Compression of Geospatial Data with Varying Information Density" Pangeo Showcase	2	710	November 9, 2023
Wednesday October 26th 2022: Xbitinfo: Compress datasets based on their information Pangeo Showcase	0	1017	October 17, 2022
Xarray and compression options for large NetCDF files Data	8	3768	March 8, 2022
New to Pangeo? A Quickstart Guide for Data Analysts and Engineers Education	4	1829	November 10, 2022
First 2023 Pangeo showcase at the Feb 1 community meeting! News & Announcements	1	1038	January 27, 2023

Assessment tools for lossy compression of geoscientific data

Assessment tools for lossy compression of geoscientific data

Scientific Motivation

Proposed Hacking

Anticipated Data Needs

Anticipated Software Tools

Desired Collaborators

Related topics