Hi all, I’m posting this here at the suggestion of @rabernat.
I’ve recently started a company aiming to serve communities adjacent to the Pangeo community on what I refer to as the “data needs continuum.”
I am far from an expert in climate science, but what I am an expert in is running reliable data engineering teams at internet scale. I’d like to apply what I know to the climate space because, like all of you, I want to contribute to the fight of our generation in a productive way. Possibly unlike some of you, I also intend to do it in an unapologetically commercial way.
I’m taking a big chance and trying out this discussion in public because I want to be a good citizen of this community and compliment, rather than compete with, the awesome things you all are doing. Sorry this is long-winded and I promise I’ll get to an actual question soon enough.
I have a thesis that there are many people outside the climate science community who could make a lot of use of data like, especially, the CMIP6 ScenarioMIP datasets, but who cannot ascend the steep learning curve from knowing nothing but SQL (and maybe vanilla numpy) to then using xarray, dask, ESGF, intake-esm, etc, let alone understanding the nuances and vocabulary of CMIP itself. I, myself, made this very learning journey this year and it was definitely challenging, even given my background in big data.
If you clicked through my link above, I’m talking about serving people in the “advanced business intelligence” and, secondarily, the “sophisticated almost-science” buckets. To save people in those buckets time, I’m making opinionated, pre-stitched, pre-computed subsets of ScenarioMIP data available on a certain cloud-native, SQL-oriented database that I’d rather not name publicly, yet (but will happily name privately). Before you ask, I have experience with PB-scale data on this platform so, yes, I do know it can handle it even though it is SQL-oriented.
By opinionated, I mean I’ll only make a relatively small subset of experiments, models and variables available. By pre-stitched, I mean I’ll be aligning historical, areacella, piControl and sspXXX, etc together so that doesn’t have to be done by hand. By pre-computed, I mean precalculating stuff like moving averages, or ensemble and cross-model means and confidence intervals, etc. In my ideal world, I also make it easy to do naive, “good enough” intersections of this data with non-climate, GIS-style geographic shapes, like US states or other arbitrary polygons and points, in order to make it easier to join to tabular business data like “my list of subscriber addresses” or “my warehouse locations” (no, I don’t mean actual downscaling, though that would be rad, too, if admittedly much, much more difficult).
I call this “data engineering as a service.”
So, here’s the question.
What do you all, as a community, think some of the best docking-up points are between what I’m doing and what you’re doing? How do I best compliment what you’re doing as opposed to simply duplicating what you’re doing, but in a different form? Also, are there any sensibilities I’m not catching on to in the community that I’d be stomping all over, as a newcomer, without realizing (aside from the obvious licensing and attribution, which I feel like I have a respectful and thorough handle on)?
I realize those are pretty open ended questions, so, my apologies. I’m not asking anyone to do my homework for me, for sure. But, the last thing I want to do, in this fight of our generation, is slow us all down by introducing confusion and duplication in the ecosystem. We just simply don’t have the time. Plus, it seems rude not to ask, given all the work you’ve put into this ecosystem.
PS: I’m also looking for climate scientist collaborators, so if any of this piques your interest, personally, please feel free to email me at trcull@pollen.io