NASA has released a pretty amazing funding opportunity:
Open source software tools, libraries, and frameworks play an increasingly prominent
role in SMD-related science research and applications. As the adoption of open
software accelerates the rate of scientific discovery, the National Academies’ has
recognized the growing need among the NASA science community to provide sustained
support and maintenance of these types of software in their 2018 report Open Source
Software Policy Options for NASA Earth and Space Sciences. This program is designed
to provide support to the sustainable development of open source software, tools,
libraries, and framework that are critical for SMD scientific objectives.
SMD seeks proposals for the improvement and sustainment of high-value, open source
tools, frameworks, and libraries that have made significant impacts to the SMD science
community. We are seeking proposals that satisfy the following objectives:
• Open source software tools, libraries, and frameworks that have significant
usage in the NASA science community, developed following open and
collaborative practices, and are aligned with the scientific vision and data
strategic plan of SMD.
• Proposals should look to improve the sustainability and utility of these packages
through improvements to adding extensions, documentation, infrastructure, and
maintenance of the software.
This program seeks to support projects under active development and usage, and it
does not support updating of legacy software that is no longer supported, which can be
supported under other calls. We are not soliciting the development of new open source
tools, frameworks, or libraries with this call.
This could be a great opportunity to support core Pangeo tools such as xarray, dask, etc.
I am not personally planning to submit to this call. However, I would love to see coordination within the Pangeo community among those who are interested. Let’s use this thread to discuss.
I am very interested in collaborating on this.
Some xarray-centric ideas that came up in the icepyx thread (cc @JessicaS11),
- DatasetTree for “collections” of datasets (netCDF Groups, xarray issue)
- awkward xarrays
Both of these seem like nice general functionality that is specifically useful for satellite datasets (icepyx is great motivation!).
Perhaps some one more involved with NASA datasets can comment on the xarray/dask pain points, and then we can work out how to fix those. (cc @cgentemann)
PS: Another idea I had was to add “computational backends” so
da.mean("time") # runs numba-compiled mean
but this is quite generic…
Great ideas Deepak!
Reading the call closely, it’s not clear that a proposal needs to propose a lot of ambitious new development. This is a huge step forward for a funding agency, since we have been asking for this for years! See e.g. this blog post (I was a co-PI on the project described there.)
So a “maintenance” proposal for Xarray and the surrounding ecosystem would also be compelling.
Reading the call closely, it’s not clear that a proposal needs to propose a lot of ambitious new development
Ya, just trying to generate ideas and cover all bases here
I remember @scottyhq, @dshean & colleagues had some pain points with rasterio/xarray/mfdataset etc.
@dcherian Thanks for linking me to this discussion. This sounds amazing. As some/all of you may know I work on the Satpy and pyresample libraries which deal with reading, manipulating, and writing satellite data. Here are some things that come to mind or have popped up recently for some of our users:
xarray/dask’s handling of open files, especially when these are streams coming from a remote file system. For example, a Satpy user just discovered that a CachedFileSystem from fsspec reading from remote S3 storage requires a lot of open file objects. He’d like to work with 1000s of files (ex. ABI L1b files are 16 files per Scene and he’s dealing with M1 and M2 files which are 16 files every minute) so he quickly hits his ulimit and doesn’t have the permissions to edit it. Reading the xarray open_dataset documentation it mentions that
autoclose doesn’t work with streams. If there is any way to improve the handling of this then that’d be amazing.
Writing geotiffs in parallel. I’m not sure where the rioxarray project is on this or the xarray community in general, but it has always been a pain point for me that I can’t create geotiffs with dask using a multiprocess/cluster scheduler. This may be possible now, but last I checked it wasn’t. This may also be a limitation of the geotiff format and the libraries that write them that make it just too difficult to perform these operations.
I thought I had more, but it would seem my baby induced sleep deprivation has made me forget. I’ll try to come back to this if anything pops up.
I’m also really excited to see this proposal opportunity. Its probably best I avoid direct involvement here but I’m happy to help support any initiative that moves forward for either Xarray, Dask, or Pangeo.
On the Xarray side, I think a strong argument could be made for a maintenance proposal along side limited feature development. Building on our CZI grant for Xarray (https://doi.org/10.6084/m9.figshare.12709556.v1), we could pick a few Xarray features that would particularly relevant to NASA applications (e.g. spatial indexes and cloud data access).
Another idea that may be interesting to explore is to target a community support role for one or more projects. This is similar to what @mrocklin is doing with the Dask’s CZI award and the Jupyter team has done with their Contributor in Residence efforts.
cc @benbovy, @wtbarnes, @lheagy
Unlike the CZI proposal though, this one is possibly 10x larger. I’m going to be thinking about how to handle this on the Dask side next week (I’m ostensibly on vacation this week). I’d be happy to blend with Xarray if devs there want to collaborate. I’m not yet sure how/where to base the application from. I would probably default to NumFOCUS, but would be happy to find another administrative home (Coiled can’t be prime here due to being for-profit).
I mentioned this on the Pangeo call today, but I think bolstering the capability of Xarray to work with COG and other raster data would be awesome. xr.to_rasterio would be nice, for starters. Check out the parallel-writing TileDB with Dask code that we need because we don’t have this!
I agree, I think this could be awesome! I would be particularly excited about funding support/maintainence for xarray and related Pangeo projects, rather than just new feature development.
My reading of the solicitation is that it’s $1.5M/year, intended to be split among 5-10 projects. That’s not too far off what CZI provides (up to 250k/year/project).
Given that this small amount of funding / projects will be split among all of NASA science, it seems wise to focus on the most general purpose, widely used tools.
Great discussion, and I want to express my interest as well. As many of you know, I’ve been working to learn/use XArray as a go-to tool and we have plans to use it for icepyx, but I’m constantly running up against many of the geospatial issues mentioned. More than once I’ve considered going back to my old gdal/ogr scripts, but haven’t simply because I’m too stubborn. I see this as a huge hurdle for transitioning the broader remote sensing user community to these powerful tools, demonstrating a critical need for the functionality discussed along with more examples and documentation.
Unfortunately I can’t take on leading a proposal for this call, but I’m eager to discuss ideas and help with putting one together.
Given the migration of the DAACs to AWS (247 PB - see this post), and the substantial egress costs, seems like their could be a really strong case for improving the Pangeo - cloud-DAAC connection (data accesses/workflows, fast IO, and tools). Not sure if it belongs in this proposal, but could be included as a minor or secondary component.
Dear all, I’m really happy to see the enthusiasm about this and I support the initiative. I’m happy to help with internal review if you think it is helpful.
My bandwidth does not allow me to follow everything what’s happening on the pangeo side at the moment, but I’d like to second @dcherian 's comment above: if we can assume that NASA is interesting in leveraging their remote sensing products, convergence of the remote sensing community towards a limited number of tools (within pangeo?) would be a huge step forward. The xarray related pain points I always hear from the remote sensing community are the rasterio <-> xarray integration, CRS and grid/reprojection handling, and time-varying coordinates (e.g. swath data).
I am well aware that this is not the best thread to raise these points: where is the right place to discuss this atm? I remember the very active “grid” thread on github: did the discussion continue somewhere, are we beginning to see a convergence from the earth observation community towards specific xarray workflows?
Thanks all for the discussion and ideas! I’m also keen to collaborate on this, and can submit a proposal through University of Washington / the eScience Institute. I agree that a proposal that benefits all 4 NASA science branches (Earth Science, Planetary Science, Heliophysics, Astrophysics - https://science.nasa.gov/about-us/smd-vision) would be looked at favorably, but it might also be advantageous to focus on the Earth Science expertise of this community. I imagine there will be both broad and branch-specific proposals.
As a geoscientist, I also find myself wanting to hone in on improving integration of the Python geospatial stack (xarray, dask, geopandas, rioxarray, cartopy, geoviews… the list goes on). A lot of Pangeo’s development has been focused on pushing the cutting edge and scalability of these Python libraries, and that is awesome to take part in! However, to echo some points already made, I also agree that focusing on documentation of simple workflows and existing functionality will help drive open-source adoption.
We’ve run a number of science-focused hackweeks at University of Washington as part of NASA ACCESS2017-funded efforts, and many participants accustomed to a curated closed-source alternative (Matlab, ArcGIS) are eager to try something new but struggle with the mix-and-match world of Python. With that in mind, the community would benefit from more guides to help scientists transition from closed-source to open-source (e.g. ArcGIS -> QGIS/Python, Matlab -> Python). I’m thinking back to my own transition from matlab to python and finding this simple document extremely useful http://mathesaurus.sourceforge.net/matlab-numpy.html (probably still is !).
I’m glad to see so much interest in this proposal in the Pangeo community. I agree with @scottyhq that it would make sense to focus on our Earth Science expertise, which could then be generalized to other branches encountering similar data science challenges. Overall I think it would be important to provide specific examples of how any developments feed in to advancing NASA overall science missions, and it seems to me collectively we already have a lot of examples to build on.
I’m particularly interested in the community building and education component, and it is good to see NASA prioritizing that. In addition to the ICESat-2 library development led by @JessicaS11, we are also learning a lot about how to maintain community engagement after a sprint or workshop.
Looking forward to collaborating in some way!
The deadline to submit a notice of intent is later this week. I would encourage maintainers of other OSS libraries (Xarray, Satpy, Jupyter, etc.) to consider submitting a letter. This should be an easy task that keeps the door open for a larger future submission in January.
I’ve informed Andy from NumFOCUS about this thread in case OSS projects want to use NumFOCUS to manage the funds.
Thanks Matt! I haven’t tracked down all the requirements yet. Might be good to do a call to talk tactics early this week.
Okay so this seems like there is an expected awards of 150 - 300K. We need a brief summary by Thursday to make the NOI deadline. Seems that if we miss that deadline we can just email Steve Crawford. Because of the budget, seems that it would be best to group each code into separate proposals, but really need to see who is going to be funded.
I’m not use to collaborating with this group, how should we proceed? I could start making google docs with steps, or we could get on a call and hash out a few statements?
@JessicaS11, @scottyhq & I are hashing out things for an UW/NCAR-led “xarray for raster satellite datasets” proposal offline. We will keep this group posted with updates as things progress.
Okay sounds like others are organizing the proposal writing. Just let me know if NumFOCUS can help in the future.