Icepyx - Python tools for ICESat-2 data

JessicaS11 · July 30, 2020, 7:56pm

A few thoughts as I read through these really interesting and insightful posts (there are a lot of great suggestions!). Please forgive me for the length.

Where do we cross the line from data storage to data analysis? Hdf5 was chosen for IS2 data for a reason, and is probably the best format available right now for the complex, nested nature of ICESat-2 data (excluding a few cloud-optimized formats, but that’s another debate). But I can’t help but notice that nearly every analysis pipeline I’ve seen does not USE the data in it’s stored format, so to speak. Instead, the relatively few desired variables are read in and often saved in a “simpler” format (whether that’s sorted ascending/descending hdf5s or a csv), and computations are performed on fairly simple array representations of the data (e.g. lat, lon, time, height, etc.). I’m wondering if it could be useful to separate some of these pieces of the conversation to clarify between data storage, data read/write, and data analysis (the last of which will be dependent on read-in format for implementation but could also fairly likely be generalized to handle any array-like inputs through existing functionality).
What do we hope to enable people to do with the data that they can’t easily do now, and how does the data read-in work within the research pipeline? One of the strengths of xarray is that it makes it easier to work with combined raster and vector data. If my understanding is correct, h5py would not be a good format for bringing in or storing large gridded datasets, nor does it provide functionality to easily relate them to one another.
I’d remind you to review our survey results for question 2. Responders (29 for this question) were allowed to check as many boxes as they wanted. A review of individual responses (non-tabulated) showed relatively few people only preferring one format, and that one format was well split between hdf5, xarray, and netcdf.
Let us not forget that icepyx has a variables module that was built to make it easier to discover, manage, and access ICESat-2’s complicated nested variable structure. We shouldn’t be afraid to use that (and add to it) to handle the variable manipulation/path side of things so that variable lists can be associated with specific files (if need be) and used to pass data into and out of the various formats.
How about some version of a hybrid approach that leverages each tool for what it excels at? Certain operations, such as local variable subsetting and extracting data in a meaningful way (e.g. to capture ascending/descending info or strong/weak beam orientations, etc) are likely going to be easiest if we are interacting directly with the hdf5 file. Later operations, such as comparing to raster datasets, are likely easiest using tools like xarray. Still other operations, such as computing running means along-track, are effectively done independently of either tool, since they are simply mathematical operations on an array. On top of all of this, we have to remember that one of icepyx’s goals is to make using IS2 data easy for scientists who aren’t necessarily skilled programmers. This portion of the population is probably most familiar and comfortable with basic arrays and associated tabular formats (e.g. pandas).
I think that the potential for adding methods like sel, values, etc. onto existing library functionality (whether that is through subclassing within icepyx or as a direct contribution to the parent library itself) is a great idea. I can see ways that these functions would be developed as part of implementing data access features within icepyx.
I want to acknowledge that this entire discussion leaves out cloud-optimized formats, including those specifically designed to tackle some of these challenges. We ultimately plan to include interoperability with those formats (especially if they are made available by DAACs), so our approach should be flexible enough that it will also work with those formats.

@rabernat Do you know if it would be possible to add the ICESat-2 subcategory to this thread? I don’t see a way to on my end.

Topic		Replies	Views
ICESat-2 Cryospheric Hackweek icepyx survey results ICESat-2	0	686	July 9, 2020
About the ICESat-2 category ICESat-2	0	881	January 8, 2020
Datasets for awkward	7	944	January 28, 2022
Cloud array storage solutions Data	3	1192	November 29, 2023
Data format for a nested 2-D big array? Data	6	438	December 19, 2023

Icepyx - Python tools for ICESat-2 data

Related topics