Join the conversation about icepyx! What are your ICESat-2 use cases? What examples, features, and/or tutorials would you like find most valuable? How can you contribute? We want to hear from you!
Thanks for your post @JessicaS11! I had to google icepyx to find out what it is. Here is a link to the github repo:
(great input from everybody so far)
The challenge for us is having to deal with massive “sparse scattered data” (not necessarily traditional point cloud as the data have some structure: along tracks, beam pairs, repeat cycles, etc.). Most projects related to the xarray/NCAR community seem to be tailored towards structured/gridded/raster products (e.g. model outputs, satellite imagery). Perhaps one way forward, instead of trying to accommodate all potential use cases, is to “decide” on a preferred data structure/file format (one that is convenient) and design
icepyx functionality around that. And to avoid imposing this to users as the only way to work with ICESat-2 data, also expose the underlaying variables in a similar way as
xarray.values do, which can then be used/transformed in any way the user wants.
I agree. A nested data format similar to hdf5 is inevitable if we want to use it for all IS2 products. I think a major motivation for people to use pandas and xarray are their easy-to-use functions for data handling. Instead of converting hdf5 data to other formats to take advantage of the xarray/pandas functionality, can we build on the hdf5 format, and provide some popular functionality that are most frequently used in pandas and xarray? For example, how about we inherit the hdf5 class and add methods such as
concat , etc., to provide easier data subset and aggregation. Just as Fernando mentioned, users should still be able to use the underlying data directly by calling the
values method.Xarray is designed to handle gridded data and pandas is designed to handle tabular data. For nested data such as hdf5, there is still not an advanced package to provide similar high-level methods yet (are there?). Do you guys think this is something we should pursue or is it too far off-focus from icepyx?
Xarray isn’t just for gridded datasets actually, and it does work with HDF5 files through the h5netcdf engine (though it’s not well advertised)! I’ve actually mentioned this before in https://github.com/icesat2py/icepyx/issues/35, but just to put it here, this is how the ICESat-2 dataset (ATL06 Land Ice Height) looks like in xarray:
ds = xr.open_dataset("ATL06_20200513024639_07260711_003_01.h5", engine="h5netcdf", group="gt2r/land_ice_segments") print(ds) <xarray.Dataset> Dimensions: (delta_time: 106948) Coordinates: * delta_time (delta_time) datetime64[ns] 2020-05-13T02:46:48.00... latitude (delta_time) float64 ... longitude (delta_time) float64 ... Data variables: atl06_quality_summary (delta_time) int8 ... h_li (delta_time) float32 ... h_li_sigma (delta_time) float32 ... segment_id (delta_time) float64 ... sigma_geo_h (delta_time) float32 ... Attributes: Description: The land_ice_height group contains the primary set of deriv... data_rate: Data within this group are sparse. Data values are provide...
Each ‘data variable’ is accessible as a 1d array (e.g. via
ds.h_li_sigma, etc), and it also reads/exposes metadata from the HDF5 file directly. Note that xarray does support sparse arrays since https://github.com/pydata/xarray/pull/3117 (though again, not particularly well documented), and there’s a really vibrant ecosystem of projects around it, and options to extend it further. The Pangeo people are really nice too
Rather than reinvent the wheel, perhaps the easiest way we can get
xarray to work with ICESat-2’s nested data structure better is to have it open multiple groups directly. I’m thinking of a syntax like this:
xr.open_dataset("ATL06_*.h5", engine="h5netcdf", group=["gt2r/land_ice_segments", "gt2l/land_ice_segments"])
I can open up an issue on
xarray for this, and I’ve got some experience with the
xarray repo so could even get a pull request in (if it’s not too complicated). Just want to know if this is feasible or a good way forward.
A few thoughts as I read through these really interesting and insightful posts (there are a lot of great suggestions!). Please forgive me for the length.
Where do we cross the line from data storage to data analysis? Hdf5 was chosen for IS2 data for a reason, and is probably the best format available right now for the complex, nested nature of ICESat-2 data (excluding a few cloud-optimized formats, but that’s another debate). But I can’t help but notice that nearly every analysis pipeline I’ve seen does not USE the data in it’s stored format, so to speak. Instead, the relatively few desired variables are read in and often saved in a “simpler” format (whether that’s sorted ascending/descending hdf5s or a csv), and computations are performed on fairly simple array representations of the data (e.g. lat, lon, time, height, etc.). I’m wondering if it could be useful to separate some of these pieces of the conversation to clarify between data storage, data read/write, and data analysis (the last of which will be dependent on read-in format for implementation but could also fairly likely be generalized to handle any array-like inputs through existing functionality).
What do we hope to enable people to do with the data that they can’t easily do now, and how does the data read-in work within the research pipeline? One of the strengths of xarray is that it makes it easier to work with combined raster and vector data. If my understanding is correct, h5py would not be a good format for bringing in or storing large gridded datasets, nor does it provide functionality to easily relate them to one another.
I’d remind you to review our survey results for question 2. Responders (29 for this question) were allowed to check as many boxes as they wanted. A review of individual responses (non-tabulated) showed relatively few people only preferring one format, and that one format was well split between hdf5, xarray, and netcdf.
Let us not forget that icepyx has a
variablesmodule that was built to make it easier to discover, manage, and access ICESat-2’s complicated nested variable structure. We shouldn’t be afraid to use that (and add to it) to handle the variable manipulation/path side of things so that variable lists can be associated with specific files (if need be) and used to pass data into and out of the various formats.
How about some version of a hybrid approach that leverages each tool for what it excels at? Certain operations, such as local variable subsetting and extracting data in a meaningful way (e.g. to capture ascending/descending info or strong/weak beam orientations, etc) are likely going to be easiest if we are interacting directly with the hdf5 file. Later operations, such as comparing to raster datasets, are likely easiest using tools like xarray. Still other operations, such as computing running means along-track, are effectively done independently of either tool, since they are simply mathematical operations on an array. On top of all of this, we have to remember that one of icepyx’s goals is to make using IS2 data easy for scientists who aren’t necessarily skilled programmers. This portion of the population is probably most familiar and comfortable with basic arrays and associated tabular formats (e.g. pandas).
I think that the potential for adding methods like
values, etc. onto existing library functionality (whether that is through subclassing within icepyx or as a direct contribution to the parent library itself) is a great idea. I can see ways that these functions would be developed as part of implementing data access features within icepyx.
I want to acknowledge that this entire discussion leaves out cloud-optimized formats, including those specifically designed to tackle some of these challenges. We ultimately plan to include interoperability with those formats (especially if they are made available by DAACs), so our approach should be flexible enough that it will also work with those formats.
@rabernat Do you know if it would be possible to add the ICESat-2 subcategory to this thread? I don’t see a way to on my end.
I love seeing these sorts of in-depth technical discussions happening on our Pangeo forum!
I’ll add my unsolicited $0.02. I am a novice with ICESat-2 data, but I have done a lot of thinking about how to build sustainable, useful domain-specific scientific computing tools in python. I wanted to respond to this suggestion:
This sounds simple, and one hears it often from people who see that xarray or pandas almost, but not exactly, fits their needs. But when you go look at xarray and pandas, and you see how much careful thought, and years of iteration and effort, has gone into these features, you realize that maybe it’s not quite so simple. Another consideration is the community of potential developers: if you build a new tool that re-implements this functionality, do you have the manpower to maintain it? Make sure it works with python 3.7, 3.8, 3.9, …, on windows, mac and linux, forever?
On the other hand, many of the things you want (sparse arrays, custom indexes, support for groups) are on xarray’s development roadmap. Many of these things are underway! Instead of making a new package, you could spend your effort helping make xarray better and more flexible. That is the path we try to gently nudge folks towards in Pangeo, because it is empirically a more sustainable strategy.
Regarding the HDF5 format itself… Xarray can’t read arbitrary HDF5 files–only those that conform to the netCDF data model. Fortunately, your ICESat-2 files appear to do so: @weiji14 showed how you can open them. Another limitation is that it can only open one group at a time. One way to extend xarray with custom functionality is by writing accessors–you might want to look into this.
If you have the option to reformat the data into a different format, I would look at both Zarr and TileDB. TileDB in particular has some very cool options for sparse data.
I hope you don’t mind me sharing these opinions. Take them all with a grain of salt and do whatever is best for your community.
Thanks @rabernat for the advice. I realized the amount of work after another look at the xarray’s github page. The example @weiji14 showed is very helpful. One thing that I noticed is it only reads back all the variables immediately below the specified group path. The variables at a lower level will not be retrieved. The version 2 of the ICESat-2 ATL10 sea ice freeboard data is an example with such complicated data structure. For example, there are variables under
'gt2r/freeboard_beam_segment/height_segments/' as well as
'gt2r/freeboard_beam_segment/'. I am not sure if the version 3 data will be simpler. ATL07 and ATL09 are slightly simpler, but they still have variables under different subgroups of the same beam (the top level). Can
xarray handle such group structure in some way? Otherwise, it seems the
accessors might be a promising way for
icepyx to take advantage of the existing
xarray functionality. @weiji14, do you think it would be easier to write customized intake wrapper for different dataset? The issue with this approach is that the resulting data will be a couple of
xarray datasets and we still have to organize them in some way, such as packaging them into the IS2 data class.
ArViz.InferenceData is a good model if you really, really want to preserve the hierarchy (https://arviz-devs.github.io/arviz/notebooks/XarrayforArviZ.html).
A nice way forward would be to break out the ArViz code into a separate package that adds group handling on top of xarray (there’s a lot of demand for such a package). icepyx could then use that new package as a dependency.
‘Sustainability’ is exactly the right term to use here. We could make a custom/fit-for-purpose/efficient tool that is designed solely for ICESat-2 data, or we can put our efforts into a tool that handles satellite altimetry data for years to come, be it ICESat-3, CRISTAL or so on.
This I speak from experience, as most of my custom HDF5 scripts written for ICESat-1 data (5 years ago) just don’t work anymore. Part of it is a different data structure (1 laser instead of 6), but there are core functions of array data handling that are common to any discipline, and that should go into
xarray! This is one of Pangeo’s goals - “Foster collaboration around the open source scientific python ecosystem”, i.e. don’t work in silos, come up with a way to re-use existing tools in the ecosystem. Ideally we would have 80% of our tasks handled by PyData/Pangeo tools, and put the remaining 20% ICESat-2 domain-specific stuff into
ds = xr.open_dataset("...") ds.ipx.gt1l # Ground track 1 left laser ds.ipx.gt2r # Ground track 2 right laser # etc
Thanks for chipping in @dcherian! So I suppose the hierarchy would be something like datagroup (?) -> dataset -> dataarray. But this should really, really be a last resort (as you emphasized) if there’s a clear reason for the nested structure to exist in the first place.
Just cross-referencing the “Dataset groups” issue at https://github.com/pydata/xarray/issues/1092 and the more recent “Feature Request: Hierarchical storage and processing in xarray” issue at https://github.com/pydata/xarray/issues/4118. See also the Arviz
InferenceData documentation at https://arviz-devs.github.io/arviz/generated/arviz.InferenceData.html#arviz.InferenceData and code at https://github.com/arviz-devs/arviz/blob/v0.9.0/arviz/data/inference_data.py#L45. I’ll try to look into this more (if I have time) as it looks really promising:
Surprisingly, this somewhat works, though there’s lots of warnings, and the very nested ‘gt**’ groups aren’t being read properly:
import arviz as az datagroup = az.from_netcdf(filename="ATL06.003/2020.05.13/ATL06_20200513024639_07260711_003_01.h5") datagroup
Thanks for all of the great ideas and resources, @dcherian @liuzheng-arctic @weiji14 @rabernat @fspaolo. A great next step (thanks to @aaarendt for the idea) would be to put together a set of very simple read-in workflow examples using a single-file dataset that show the process of getting from a local data file to a loaded dataset (we can define a few nested parameters that need to be included). We can outline the process (even if the underlying readers, etc. specifically for IS2 data don’t fully exist yet) for several different iterations of the proposed tools (e.g. hdf5, captoolkit, xarray, intake, arviz) and use what we learn (what’s easy, what challenges/limitations are there, what would need to be built in and out of icepyx to fully accomplish this) to select our path forward. I’d encourage anyone who has a few minutes to start thinking about/working on this in the next couple weeks to do so, and we can plan to divide into teams during our next call to work on these!