Xarray and compression options for large NetCDF files

danshapero · January 13, 2021, 11:55pm

Hi all! I’ve got an xarray question and Scott Henderson directed me here.

I’m trying to automate testing some code I wrote on real data, specifically BedMachine and the MEaSUREs InSAR velocity for Antarctica, both on NSIDC. Both of them are stored as NetCDF. (I’ve been using only synthetic data so far.) These are both big and require an earthdata login, so it isn’t feasible to download them as part of the CI run or to pack them into the docker image I use as a testing environment.

What I’d like to do is mask over everything except the sites I’m testing on, which for now is Pine Island Glacier and Larsen C Ice Shelf. I’d expect that, since there are long runs of the no-data value or NaN, the file could be compressed a lot. For the BedMachine dataset, this works – it goes from 791MB to 28MB.

That’s all fine and good, but this doesn’t work so well for the velocity map. The original NetCDF file isn’t compressed at all, so it starts out at 6.4GB. When I just read it in and write it back out with level-1 zlib compression using the same chunk size as BedMachine, it goes down to 3.5GB. Then when I mask out the regions I don’t care about it goes down to 2.4GB. It’s better, but not nearly as much of an improvement as for the other dataset. The velocity map has fewer data points than the thickness map and it has 6 fields, whereas the thickness has 4 fields. So naively I’d expect that it’s possible to compress the velocity to roughly 1.5x the size of the thickness or about 1.2GB even without masking.

I wrote some code for this which is hosted here. I don’t feel like I have the best understanding of what the encoding does, so it’s likely that I mis-specified something there.

rabernat · January 14, 2021, 4:59am

Hi and welcome to Pangeo!

Quick clarification question: does the file you write HAVE to be NetCDF, or are your open to using Zarr as your intermediate format? The reason I ask is that Zarr has more flexible compression options.

danshapero · January 14, 2021, 6:33pm

Hi Ryan, thanks for getting back to me! I tried out your suggestion and wrote the masked data out to Zarr; the velocity map is down to 1.5GB, which is much better than before but still larger than what I was able to get on the other file. I used the Blosc compressor, which I copied from the example in the xarray documentation. I don’t know the difference between the various compression algorithms so if you have suggestions for other settings I’m happy to try them.

If I were sure that I’d done everything else correctly, then fixing this might just be a matter of switching to Zarr to get access to better compression algorithms. But since I’m getting very little space savings by masking, I get the feeling that I’ve done something wrong in how I’ve tried to mask the data and thus won’t see any appreciable savings regardless of which storage format I use.

RichardScottOZ · January 16, 2021, 4:50am

That would be a good info comparison, different types of compression algorithms and levels on a representative dataset.

danshapero · January 20, 2021, 4:57pm

Well I found out what was wrong! The dataset that’s so difficult to compress was storing the spatial coordinates in polar stereo, but it was also storing the latitude and longitude of all of those points as 64-bit floats. Dropping the lat/lon fields and applying the same compression options as the easy dataset makes it go from 6.4GB to 1.2GB (exactly as I expected!) and then masking out the regions I don’t care about crunches it down further to 37MB.

Sorry to bother everyone over something so silly. To speak to Richard Scott’s point, I also think it would be great to have a “compression for the clueless” guide somewhere.

RichardScottOZ · January 20, 2021, 10:36pm

@danshapero I get a 404 on your example link above? https://gitlab.com/danshapero/xarray-compression

Thanks!

danshapero · January 20, 2021, 10:53pm

@RichardScottOZ woops, forgot to make it public! Should be visible now.

aaronspring · March 8, 2022, 7:37am

@danshapero @RichardScottOZ I am exactly at this point right now, where I need a compression for the clueless. But the link is unavailable to me. Do you still have access or could share it with me somehow? Best regards

dcherian · March 8, 2022, 7:47pm

For netCDF I look here: netCDF4 API documentation but my goal is usually to “save some space” not “save all possible space”

Topic		Replies	Views
Writing large Xarray datasets to NetCDF Science	7	2384	September 21, 2023
Reading a Larger than RAM NetCDF4 using Xarray Data zarr	7	154	June 24, 2025
CMIP6 Zarr datasets on AWS — useful for interactive exploration? Data	1	909	June 10, 2021
Processing large (too large for memory) xarray datasets, and writing to netcdf Science	12	7257	December 12, 2024
Assessment tools for lossy compression of geoscientific data CMIP6 Project Proposals location-ncar	10	1141	October 16, 2019

Xarray and compression options for large NetCDF files

Related topics