NASA funding and the Pangeo ecosystem

tl;dr: Pangeo tools are great already, but there’s more to be done! Let’s coordinate efforts to fund more support and improvements!

Outcomes from Pangeo ML NASA award

Several years ago Joe Hamman, Ryan Abernathey, Dave Hoese, Jim Bednar, and Tom Augspurger submitted a proposal to the 2020 NASA ACCESS RFP to build open source tools and pipelines for scalable machine learning using NASA Earth observation data. The project goals have remained the same since the beginning, although the team has shifted, including Martin Durant stepping in for Tom and me stepping in for Joe. New team members have also joined including Raphael Hagen and Andrew Huang.

I’ve uploaded the final report from our main 3 year project duration to figshare in case people are interested in reading more about the work from this project. Here’s a small subset of the many accomplishments:

  • Major progress towards Pyresample 2.0 to allow SatPy and Pyresample to perform better and be more compatible with Dask and Xarray
  • Numerous features and updates in the Holoviz suite of tools, including improved support for interactive exploration of earth science and ML datasets through GeoViews, hvPlot, Datashader, SpatialPandas, Holoviews
  • Improved dataset access via development in fsspec, Intake, and Kerchunk
  • Performant data loading with Xbatcher
  • General support of the Pangeo software ecosystem via contributions to libraries like Xarray and Zarr
  • New demonstrations of best practices for scalable machine learning, leading to research outputs and educational Pythia cookbooks
  • Community development via blog posts, podcasts, conference presentations, tutorials, sprints, working groups, and more

2024 NASA funding opportunities

Funding for Pangeo tools can have a big impact for our community and beyond. There are several NASA funding calls for 2024 and it would be great if they were used to support more development within the Pangeo ecosystem. If anyone else is interested in these opportunities and would like to join a call to discuss ideas and coordination, I’d welcome you to share here and we can set up a time for a group call.

2024:
Topical Workshops Symposiums, and Conferences (TWSC)
F.7 Support for Open-Source Tools, Frameworks, and Libraries
F.14 High Priority Open-Source Science
F.9 Citizen Science Seed Funding Program
F.8 Supplements for Open-Source Science

11 Likes

Two components that I am particularly compelled to propose are:

  1. Validation tools for cloud optimized datasets (xref Tool for validating geo data/services moved to the cloud? - #7 by rabernat, but including both new datasets and migrated datasets). In a discussion with @rabernat a couple years ago, he pointed me to great_expectations as an example of this being done well in other fields.
  2. General support for the Pangeo ecosystem of tools including Xarray, Zarr
3 Likes

Or continuing development on zarr pyramids, tiff support in xarray, or regridding?
But please loop me in. USGS might provide an “in kind” contribution.

2 Likes

Validation tools for cloud optimized datasets (xref Tool for validating geo data/services moved to the cloud? - #7 by rabernat , but including both new datasets and migrated datasets). In a discussion with @rabernat a couple years ago, he pointed me to great_expectations as an example of this being done well in other fields.

There’s also pandera which works with Dask dataframes
https://pandera.readthedocs.io/en/stable/dask.html?highlight=dask

I’d like to also propose better exploratory drill down support for multi-dimensional datasets. We are starting this here for a project, but the project itself is winding down: Add support for popups on selection streams by philippjfr · Pull Request #6168 · holoviz/holoviews · GitHub

3 Likes

Here to upvote the validation tools. Some folks at ESDIS presented great-expectations in August 2023 as a potential tool for the DAACs to use but not sure what came of it. We also have various tools specific to the DAACs that would be great to compile and compare notes on.

2 Likes

We’re exploring the potential for a Zarr-focused proposal to F.7 Support for Open-Source Tools, Frameworks, and Libraries to help build out Zarr-Python 3, including a number of key extensions enabled by the v3 spec. Happy to discuss on a call.

3 Likes

Thanks for sharing your interest @thodson, @ahuang11, @briannapagan, and @jhamman! Let’s go ahead and set up a call to discuss these ideas, with a particular focus on the F.7 funding opportunity (LOI due on May 03, 2024).

Could you please fill out this when2meet? I’ll use that to announce a time and share a zoom link for a time between April 17-22.

1 Like

Hey @aterrel, I’m wondering if you or anyone from your team would be interested in participating in our discussion about these funding opportunities next week?

Thanks Max, be good for some people to look at.

Thanks for the responses, everyone!

Let plan to meet Wednesday, April 17 at 2 PM ET. Please reach out if you’d like a calendar invite, otherwise here’s a zoom link. Looking forward to the discussion!

1 Like

Hi, I am Miguel and I am a member of OPeNDAP. We are interested in interacting and collaborating more with the Pangeo community, so we’d be happy to join the meeting too if you think that’s a good fit for a potential NASA funded collaboration. We have the Hyrax server on the EarthData cloud as a result of a long standing NASA contract, and we want to find ways to best serve the broad Pangeo community. We think the following are areas that a potential NASA funding could help accomplish, but we are open suggestions:

  1. Provide support to and further develop pydap (xarray backend). Pydap has long been neglected and while it continues to be widely used (e.g. thousands of access requests to NASA EarthData on the cloud per month), there are many features it does not support (e.g. DAP4/groups) even when the OPeNDAP servers long have.
  2. Enable support for zarr stores so that OPeNDAP servers can operate on them. This involves providing compatibility of our DMR++ libraries with Blosc for chunk decompression. DMR++ libraries enable efficient data requests for data in POSIX filesystem or S3 (via Hyrax).
  3. Enable some level of integration between the Hyrax server and the Kerchunk library so that Kerchunk can access data on the cloud via Hyrax (using the DMR++ documents). This will allow further integration between OPeNDAP and Pangeo tools, and potentially higher levels of parallelization.

This seems potentially related to the suggestion in Using hidefix to determine byte ranges in HDF files? · Issue #38 · gauteh/hidefix · GitHub

Thanks for sharing, Miguel! Yes, we’d definitely welcome you to participate in the discussion on Wednesday about how all these ideas connect.

1 Like

Interesting. It looks like the (experimental) rust+python hidefix is indeed inspired by the DMR++ module. Cool!