Icepyx - Python tools for ICESat-2 data

I love seeing these sorts of in-depth technical discussions happening on our Pangeo forum!

I’ll add my unsolicited $0.02. :laughing: I am a novice with ICESat-2 data, but I have done a lot of thinking about how to build sustainable, useful domain-specific scientific computing tools in python. I wanted to respond to this suggestion:

This sounds simple, and one hears it often from people who see that xarray or pandas almost, but not exactly, fits their needs. But when you go look at xarray and pandas, and you see how much careful thought, and years of iteration and effort, has gone into these features, you realize that maybe it’s not quite so simple. Another consideration is the community of potential developers: if you build a new tool that re-implements this functionality, do you have the manpower to maintain it? Make sure it works with python 3.7, 3.8, 3.9, …, on windows, mac and linux, forever?

On the other hand, many of the things you want (sparse arrays, custom indexes, support for groups) are on xarray’s development roadmap. Many of these things are underway! Instead of making a new package, you could spend your effort helping make xarray better and more flexible. That is the path we try to gently nudge folks towards in Pangeo, because it is empirically a more sustainable strategy.

Regarding the HDF5 format itself… Xarray can’t read arbitrary HDF5 files–only those that conform to the netCDF data model. Fortunately, your ICESat-2 files appear to do so: @weiji14 showed how you can open them. Another limitation is that it can only open one group at a time. One way to extend xarray with custom functionality is by writing accessors–you might want to look into this.

If you have the option to reformat the data into a different format, I would look at both Zarr and TileDB. TileDB in particular has some very cool options for sparse data.

I hope you don’t mind me sharing these opinions. Take them all with a grain of salt and do whatever is best for your community.