Looking for the best way to compliment, rather than compete with, this community, but commercially

Hi all, I’m posting this here at the suggestion of @rabernat.

I’ve recently started a company aiming to serve communities adjacent to the Pangeo community on what I refer to as the “data needs continuum.”

I am far from an expert in climate science, but what I am an expert in is running reliable data engineering teams at internet scale. I’d like to apply what I know to the climate space because, like all of you, I want to contribute to the fight of our generation in a productive way. Possibly unlike some of you, I also intend to do it in an unapologetically commercial way.

I’m taking a big chance and trying out this discussion in public because I want to be a good citizen of this community and compliment, rather than compete with, the awesome things you all are doing. Sorry this is long-winded and I promise I’ll get to an actual question soon enough.

I have a thesis that there are many people outside the climate science community who could make a lot of use of data like, especially, the CMIP6 ScenarioMIP datasets, but who cannot ascend the steep learning curve from knowing nothing but SQL (and maybe vanilla numpy) to then using xarray, dask, ESGF, intake-esm, etc, let alone understanding the nuances and vocabulary of CMIP itself. I, myself, made this very learning journey this year and it was definitely challenging, even given my background in big data.

If you clicked through my link above, I’m talking about serving people in the “advanced business intelligence” and, secondarily, the “sophisticated almost-science” buckets. To save people in those buckets time, I’m making opinionated, pre-stitched, pre-computed subsets of ScenarioMIP data available on a certain cloud-native, SQL-oriented database that I’d rather not name publicly, yet (but will happily name privately). Before you ask, I have experience with PB-scale data on this platform so, yes, I do know it can handle it even though it is SQL-oriented.

By opinionated, I mean I’ll only make a relatively small subset of experiments, models and variables available. By pre-stitched, I mean I’ll be aligning historical, areacella, piControl and sspXXX, etc together so that doesn’t have to be done by hand. By pre-computed, I mean precalculating stuff like moving averages, or ensemble and cross-model means and confidence intervals, etc. In my ideal world, I also make it easy to do naive, “good enough” intersections of this data with non-climate, GIS-style geographic shapes, like US states or other arbitrary polygons and points, in order to make it easier to join to tabular business data like “my list of subscriber addresses” or “my warehouse locations” (no, I don’t mean actual downscaling, though that would be rad, too, if admittedly much, much more difficult).

I call this “data engineering as a service.”

So, here’s the question.

What do you all, as a community, think some of the best docking-up points are between what I’m doing and what you’re doing? How do I best compliment what you’re doing as opposed to simply duplicating what you’re doing, but in a different form? Also, are there any sensibilities I’m not catching on to in the community that I’d be stomping all over, as a newcomer, without realizing (aside from the obvious licensing and attribution, which I feel like I have a respectful and thorough handle on)?

I realize those are pretty open ended questions, so, my apologies. I’m not asking anyone to do my homework for me, for sure. But, the last thing I want to do, in this fight of our generation, is slow us all down by introducing confusion and duplication in the ecosystem. We just simply don’t have the time. Plus, it seems rude not to ask, given all the work you’ve put into this ecosystem.

PS: I’m also looking for climate scientist collaborators, so if any of this piques your interest, personally, please feel free to email me at trcull@pollen.io

1 Like

Similar: Preprocessed CMIP data into timeseries.

But from what I read you want to make gridded longitude latitude data available.

1 Like

I’ve actually been corresponding directly with Zeb, as luck would have it! But, yes, I actually want to keep the data at the same resolution, as opposed to summarizing it like he did in that paper. But he’s the one who pointed me to stitching piControl with the sspXXX as an ongoing challenge that would be good to solve (which he did in that paper, but had to do it by hand).

1 Like

Another group that has been working in this space is Rhodium Group / Climate Impacts Lab

They were behind this very cool interactive graphic in the NY Times:

@dgergel in particular has been hard at work on the CMIP6 cloud data pipeline and perhaps has some relevant insights to offer.

It’s fantastic that you’re using the CMIP6 cloud data to the business intelligence community. That’s one of the main reasons we think climate data belong in the cloud in the first place! To try to answer your question specifically, I would say that our community is very interested in openness, computational transparency, and data provenance tracking.

If you’re going to be selling data products derived from CMIP6, I would argue it’s very important to have the data pipeline be as transparent and automated as possible, such that upstream errata from ESGF can propagate to your database in a timely manner. From our point of view, the best way to achieve trustworthiness would be via open source, specifically, open-sourcing the code and pipeline that you are using to produce these datasets, and opening up this project to community input via GitHub.

Oh yes, for sure, I’m carrying the data provenance and licensing of the original data all the way through to the end product, down to the very URL we pulled the NetCDF files from and the full text of the license metadata field in the original NetCDFs. And we’ll be monitoring ESGF for any changes at least daily. Though, as Zeb’s paper linked above illustrates quite well, that starts to look really ridiculous once you’re aggregating a lot of data into more highly summarized datasets, so I honestly doubt very many people will be looking at it in detail, in reality. But it is part of the original CMIP Creative Commons licensing which we’re absolutely honoring, so it will be there, nonetheless.

For open sourcing the code, that feels important for parts that are doing material calculations on top of the data (which, there isn’t any yet, really), so the community can verify the calculations are correct. But I honestly doubt much else will be useful because it’s all infrastructure code that’s quite specific to our own infrastructure. The real value add is in expending some compute resource and storage to roll stuff up and make it available in a way that removes friction and creates a shallower learning curve, not so much in the code itself.

Certainly, to the extent that we’re finding we need to alter other open source projects like, say, xarray, we’d definitely contribute that back.

1 Like

Hi Tim,

This idea is very similar to the reason that I started my data weather/climate data service (https://oikolab.com), but with ERA5 & GFS data. A common frustration for many people is that historical weather/climate data is very time-consuming to find and on the other hand, most of ‘how much the world has warmed up’ shows time-series at the global or regional scale so that it’s very difficult to relate to. The key idea is to be able to fetch time-series data for each location via REST API to be able to perform analyses such as these with ease:

Much of my back-end ETL process overlaps with Pangeo stack and I’m eternally grateful for packages such as Zarr and Xarray being available that allowed me to do this as someone who had very superficial software or climate background to start with. The service is not as mature as I would like it to be yet but we have small but growing number of very happy users.

Happy to share my experience if you’d like.

  • Joseph
1 Like

@rabernat, thanks for the tag.

Hi Tim,

Thanks for posting here to get community input - I appreciate that you don’t want to introduce duplication into the existing ecosystem (far too many don’t recognize that this is an issue!).

I have a few comments, as a climate scientist doing climate science research in the private sector while also working with lots of folx in academia through the Climate Impact Lab.

Firstly I think that it’s great that you’ve recognized the access problem with climate data. CMIP6 is really challenging for people not in climate research to use, and as you’ve recognized, for a host of reasons, from not understanding the language of the experiments to accessing model output from ESGF. Like Ryan mentioned, I’ve been working on making CMIP6 data more accessible via what we’re calling the CMIP6-in-the-cloud pipeline. But that still requires knowing what experiments, models, scenarios, etc you want to use and selecting those.

While I think you’ve identified a key need in the business community, one major issue is that a lack of familiarity with CMIP6, and climate science and climate change research more broadly, leads people to make suboptimal choices in deciding what climate data to use. More and more companies in the private sector are hiring climate scientists and building out climate risk teams to deal with this problem, but it still remains a big one. I’m guessing you’ve already come across a recent article, Fiedler et al 2021, that addresses this issue, but if you haven’t, I highly recommend it - it gets at the heart of some of the challenges of making climate data available without having an interlocutor to help non-climate folx understand what data to use and how to use it.

Secondly, while you mention that you plan to make “material calculations on top of the data” open source, you also discuss how you may not make the infrastructure code open source because it’s too specific to your infrastructure. I would have a word of caution here - I think this involves a series of decisions that should be open source even if they seem hyperspecific to your particular architecture. Pipelines involve decisions that lead to datasets, and I think we’re in an era now where pipelines should also be open source. This imperative is even stronger for something like climate data that will have potentially dramatic implications in the business and financial communities for decisionmaking around climate risk.

Again, I’m really glad you posted here, and I’ll most likely follow up with you via email. I’m curious to hear more about what you’re building out.

1 Like

Regarding closed source vs open source, I’m happy to continue that conversation over email as much as you like but (perhaps ironically) it’s not necessarily one I’m comfortable having over the open internet. But to close it out for those watching at home, I’d say there are definitely proven business models built around open source, most notably the “open source core + closed source convenience layer” model with Cloudera as an example. And there are several efforts already underway even in the Pangeo-adjacent space in the “turnkey, deployed framework” model for, say, hosted Dask clusters. It should be noted, however, that even with those two paradigms, there’s still quite a bit of closed source code in play, which is basically what I’m referring to when I say “stuff that is specific to our infrastructure.” In any case, where we change and leverage what the community has provided, we intend to release it back, like any good open source citizen should.

That Fiedler study is super interesting, thanks for the referral! I know what I’ll be reading tonight.

In any case, I wasn’t so much looking to explore business models, but was more looking to explore the climate data problem space, itself. As in, if you all are familiar with particular problems that people are already solving well and easily with the Pangeo stack, then that’s an area we wouldn’t want to wade into and just add confusion. And, conversely, if you’re aware of particular problems that people are trying to bludgeon Pangeo into doing even though it’s not a natural fit, then that could be interesting for us. What’s especially interesting would be any really common data pre-processing basically everyone does that’s painful. As a relatively trivial example, if you wanted to take a slice of CanESM5 ssp585 from 2020 through 2050 for only +/- 15 degrees lat from the equator and then do something with it, you’d have to download all the data first into a cluster, only to slice away most of it once it’s there. But, extracting that from our service directly into an xarray to do further analysis in Pangeo is, potentially, trivially easy. As is doing other relatively straight forward calculations over the data, in situ, before then downloading it into an xarray and doing more complicated calculations. That kind of thing.

As the world continues to wake up to the need to consider climate in basically everything, eventually there won’t be enough of y’all climate scientists to go around for everyone to have a team of their own. So it will be ever more important to spare you the hum-drum data engineering and keep you focused on your real value add, instead. Or, better yet, answer some of the easier questions entirely without you so you can stick to answering the hard questions, instead.

If anyone has flashes of inspiration along those lines (the “data engineering as a service” I was referring to) then I’m happy to figure out the most collaborative way to dock them up.

This is an interesting, concrete example of the kind of convenience you are trying to provide through your platform. I think it’s worth exploring the best way to achieve this goal. Your assertion is that we should generate yet another copy of the CMIP6 data in an even more analysis ready format, because the existing Zarr cloud data is not convenient enough. This community has already worked quite hard to bring CMIP6 to the cloud in Zarr, precisely with the goal of providing a more user-friendly analysis experience! So your comments suggest that that effort has, so far, not been a total success; the data remain inaccessible and hard to work with.

There are a couple of different ways we could make the data access more convenient, working within the open-source / open-data framework to improve the way the data are stored and accessed on the cloud. With regards to your specific example, the reason the data are difficult to slice in space (within +/- 15 deg) is because of chunking. The CMIP6 Zarr data are generally chunked in time, contiguous in space. As you correctly state, this means you generally have to download the whole dataset if you want a timeseries at just a single point. But there are many specific ways we could improve this situation:

  • Recent technical advances in Zarr, specifically, partial chunk reads, could overcome this limitation.
  • Using Caterva within Zarr could have a similar impact.
  • We could try TileDB instead of Zarr, which is supposed to handle this situation better.
  • We could use much smaller chunks, to facilitate easier slicing. This is feasible because of async in zarr, another recent feature enhancement that came out of this community.
  • (Related) We could generate a rechunked copy of some datasets to support different modes of analysis. Rechunking data has been a major topic of discussion in this forum.

All of these are feasible if there is engineering effort devoted to them.

As for improving search and discoverability of the data, that is a major emphasis of the CMIP6 cloud / ESGF group, with many different options being considered (STAC, ElasticSearch, etc.). Creating a more user-friendly front-end would be a great service to the community.

Alternatively, one could imagine ingesting the data into a proprietary data platform, where this convenience is provided by a black box, for a fee. Perhaps it’s obvious at this point which route I favor for my own efforts. I blogged about it here:

What would be great would be to identify areas where contributing to the open-source foundations (Xarray, Zarr, STAC, etc.) of this CMIP6-in-the-cloud effort would be mutually beneficial to both your efforts and to the broader science community that laid the foundations. I tried to enumerate some very specific ideas above. To circle back to the original topic, devoting some of your company’s resources to contribute to the open-source stack and open data archive would be a great way to “compliment, rather than compete with, this community”. :grinning_face_with_smiling_eyes:

A a concrete step forward, we could start exploring whether any of the ideas I enumerated above could help lower the friction for timeseries analysis of the “relatively small subset of experiments, models and variables” that you’re interested in.