Hi all! I’ve got an xarray question and Scott Henderson directed me here.
I’m trying to automate testing some code I wrote on real data, specifically BedMachine and the MEaSUREs InSAR velocity for Antarctica, both on NSIDC. Both of them are stored as NetCDF. (I’ve been using only synthetic data so far.) These are both big and require an earthdata login, so it isn’t feasible to download them as part of the CI run or to pack them into the docker image I use as a testing environment.
What I’d like to do is mask over everything except the sites I’m testing on, which for now is Pine Island Glacier and Larsen C Ice Shelf. I’d expect that, since there are long runs of the no-data value or NaN, the file could be compressed a lot. For the BedMachine dataset, this works – it goes from 791MB to 28MB.
That’s all fine and good, but this doesn’t work so well for the velocity map. The original NetCDF file isn’t compressed at all, so it starts out at 6.4GB. When I just read it in and write it back out with level-1 zlib compression using the same chunk size as BedMachine, it goes down to 3.5GB. Then when I mask out the regions I don’t care about it goes down to 2.4GB. It’s better, but not nearly as much of an improvement as for the other dataset. The velocity map has fewer data points than the thickness map and it has 6 fields, whereas the thickness has 4 fields. So naively I’d expect that it’s possible to compress the velocity to roughly 1.5x the size of the thickness or about 1.2GB even without masking.
I wrote some code for this which is hosted here. I don’t feel like I have the best understanding of what the
encoding does, so it’s likely that I mis-specified something there.