Xarray and compression options for large NetCDF files

Hi all! I’ve got an xarray question and Scott Henderson directed me here.

I’m trying to automate testing some code I wrote on real data, specifically BedMachine and the MEaSUREs InSAR velocity for Antarctica, both on NSIDC. Both of them are stored as NetCDF. (I’ve been using only synthetic data so far.) These are both big and require an earthdata login, so it isn’t feasible to download them as part of the CI run or to pack them into the docker image I use as a testing environment.

What I’d like to do is mask over everything except the sites I’m testing on, which for now is Pine Island Glacier and Larsen C Ice Shelf. I’d expect that, since there are long runs of the no-data value or NaN, the file could be compressed a lot. For the BedMachine dataset, this works – it goes from 791MB to 28MB.

That’s all fine and good, but this doesn’t work so well for the velocity map. The original NetCDF file isn’t compressed at all, so it starts out at 6.4GB. When I just read it in and write it back out with level-1 zlib compression using the same chunk size as BedMachine, it goes down to 3.5GB. Then when I mask out the regions I don’t care about it goes down to 2.4GB. It’s better, but not nearly as much of an improvement as for the other dataset. The velocity map has fewer data points than the thickness map and it has 6 fields, whereas the thickness has 4 fields. So naively I’d expect that it’s possible to compress the velocity to roughly 1.5x the size of the thickness or about 1.2GB even without masking.

I wrote some code for this which is hosted here. I don’t feel like I have the best understanding of what the encoding does, so it’s likely that I mis-specified something there.

2 Likes

Hi and welcome to Pangeo! :wave:

Quick clarification question: does the file you write HAVE to be NetCDF, or are your open to using Zarr as your intermediate format? The reason I ask is that Zarr has more flexible compression options.

1 Like

Hi Ryan, thanks for getting back to me! I tried out your suggestion and wrote the masked data out to Zarr; the velocity map is down to 1.5GB, which is much better than before but still larger than what I was able to get on the other file. I used the Blosc compressor, which I copied from the example in the xarray documentation. I don’t know the difference between the various compression algorithms so if you have suggestions for other settings I’m happy to try them.

If I were sure that I’d done everything else correctly, then fixing this might just be a matter of switching to Zarr to get access to better compression algorithms. But since I’m getting very little space savings by masking, I get the feeling that I’ve done something wrong in how I’ve tried to mask the data and thus won’t see any appreciable savings regardless of which storage format I use.

1 Like

That would be a good info comparison, different types of compression algorithms and levels on a representative dataset.

Well I found out what was wrong! The dataset that’s so difficult to compress was storing the spatial coordinates in polar stereo, but it was also storing the latitude and longitude of all of those points as 64-bit floats. Dropping the lat/lon fields and applying the same compression options as the easy dataset makes it go from 6.4GB to 1.2GB (exactly as I expected!) and then masking out the regions I don’t care about crunches it down further to 37MB.

Sorry to bother everyone over something so silly. To speak to Richard Scott’s point, I also think it would be great to have a “compression for the clueless” guide somewhere.

2 Likes

@danshapero I get a 404 on your example link above? https://gitlab.com/danshapero/xarray-compression

Thanks!

@RichardScottOZ woops, forgot to make it public! Should be visible now.

1 Like