Potential for adapting Pythia Foundations for different disciplines (e.g., neuro)

Hi all. Thanks for being so welcoming at the community meeting yesterday! It was really fun to attend and see how similar the issues are to the projects I have worked on.

I’ve got a lot of thoughts and I tend to write too much (sorry). So I’m going to break this up into sections.

Why I came here
As I said at the meeting, I have been thinking about the best way to help users of some of neuro-specific packages get on board with Python and the data science ecosystem. I run workshops on the package that I’ve helped maintain, and this has been the biggest request: help me learn Python/the basic data science stack.

Your presentation at SciPy last year was great because it seems a natural fit (I have really struggled with how to help users learn matplotlib, numpy, and especially pandas).

I got here partly because I previously tried to roll my own Python class (GitHub - EricThomson/practical_python: Learn just enough Python to start analyzing and visualizing data.) based mostly on other peoples resources, but it was largely a failure for a host of reasons I’d be happy to share. :smiley:

Pythia foundations for neuro?
I have been thinking ever since the Scipy presentation that it would be great to reshape Pythia Foundations for neuroscientists. I have mostly thought about what I would want to cut (sorry :smiling_face_with_three_hearts:) – cartopy, datetime, xarray (which is great but not that big in neuro yet). Neuro would probably also want to revamp the data formats section, as ours are idiosyncratic. I’d probably want to add scikit-learn and maybe scikit-image.

I’d probably want to add more on image loading/viewing and especially image stacks (especially tiff stacks – tiff is a terrible but ubiquitous format in neuro), and tools for viewing them (either napari or fastplotlib or others).

At any rate, I haven’t thought through the specifics of what it would look like too much, for a couple of reasons. First, I mainly was looking to get your thoughts (and I appreciate the openness to usage of the material – obviously if I were ever to adopt the material it would be with grateful and transparent attribution to the source). Second, I wouldn’t want to do this alone I would want to talk to other neuroscientists first and try to build some support because one pitfall I want to avoid is doing it alone.

Pythia foundations: generic version?
I’ve also thought about other options, like could there be a “generic” Foundations template that didn’t have anything specific to any science, but would be generically useful to RSE types trying to make their way in the Python ecosystem? This could either just be used straight-up, or be adapted by specific groups.

I have been thinking a lot about how to help onboard people into data science/ML/AI more generally lately, especially people underrepresented in tech and the “data” sciences.

I am from Durham NC which is very diverse, but walking into work, or going to tech meetings, Scipy, etc. it’s like walking through a diversity flattening filter, and this really bothers me and I’ve been thinking a lot about what we might do as a community about this and how we might maximize usage of great educational materials. I wonder if, in this spirit, we could work on a minimal kernel of the Pythia Foundations repo that specific scientific disciplines could then use or adapt for their own needs: e.g., someone interested in public health, bioinformatics, physics, and they could take it and adapt the “generic” Foundations course for their particular needs?

E.g., most disciplines don’t need cartopy, or a heavy emphasis on datetime, or (arguably) xarray (this hurts me to say b/c I love xarray+dask).

I see the core trio as numpy, matplotlib, and pandas (and as you note in the repo, we all realize scipy is important, but I think we just learn what we need from osmosis, and that is how you handle it in Pythia, which seems right).

I might pitch adding scikit-learn to that triad (though maybe I’m just being biased because I’m from neuroscience so feel free to shoot me down here :blush:, but … :question:).

tl;dr
Project Pythia is amazing. I’d love to see it adapted for other disciplines, but maybe the right way to do that is to create a core generic version with the geo bits cut out so that individual disciplines can modularize and add in what they need where they need it.

Please take this all in the spirit of brainstorming and trying to figure out the best way for others to take advantage of the great work you are doing, and use your work as a model. I’d be very curious to hear what you think!

4 Likes

This sounds like an awesome idea. You might also be interested in how the ClimateMatch academy was modelled after NeuroMatch.

We would love to talk to you about how to increase adoption of xarray+dask in the neuroscience community. Xarray tries hard to be domain-agnostic, positioning itself as a labelled wrapper of gridded datasets. I see it as a similar level of generality as Pandas, which is a domain-agnostic labelled wrapper of tabular datasets. (But as an xarray dev I’m completely biased :sweat_smile:)

In fact we actually have funding from CZI to hire a full time person to work on xarray specifically to improve integration with bio-science use-cases (and we want to hire someone with a bio background), so now would be a great time to push this.

4 Likes

Tom, that sounds like a good plan. I have scratched my head about why dask/xarrays have not caught on more in neuro. One notable exception is the minan package for calcium imaging analysis which was really well designed: GitHub - denisecailab/minian: miniscope analysis pipeline with interactive visualizations

I honestly have considered helping to support that project partly because the main dev who built it, and did a great job, graduated so it seems to not be getting much support currently.

I think what neuro needs is outreach and education and simple tutorials on xarrays because we are just starting to appreciate the needs for out of core computation that the geosciences have been handling forever. A library that I worked on a lot used memmapping but it is basically a relic – memmapping is basically an os-specific black box that nobody understands. xarrays are much more transparent, cloud-friendly, etc (as you know).

Plus, with things like Cubed (which I see you are a contributor to!), we can have bounded computations which is a big issue with some of our ridiculously long movies where we do things like nonnegative matrix factorization.

Building out a proof of concept with Minian, and cubed, with some really large movies, showing how it works from laptop to cloud, would be a pretty amazing use case I think. :slightly_smiling_face:

This is where the field needs to go IMO, I’d definitely be interested in chatting more.

3 Likes

That’s a really cool package that I had not seen before!

I think what neuro needs is outreach and education and simple tutorials on xarrays

There is no reason that some of these examples and tutorials could not live in the xarray documentation itself. In a call last week we were planning our xarray tutorial at SciPy this year, for which we maintain the xarray-tutorial repository. We said what we really wanted are example non-geoscience datasets that can be opened as xr.Dataset objects. Could we work together on adding a small example neuroscience dataset to GitHub - pydata/xarray-data: Data repository for xarray examples (so that it could be opened with xr.tutorial.open_dataset('neuro_something'))?

(cc’ing @scottyhq @JessicaS11 )

Plus, with things like Cubed (which I see you are a contributor to!), we can have bounded computations which is a big issue with some of our ridiculously long movies where we do things like nonnegative matrix factorization

We were just saying yesterday that Cubed needs some specific real-world use cases / workflows to really put it through its paces! A workload requiring bounded memory use would be a great example to work on. (cc @tomwhite @rabernat )

Building out a proof of concept with Minian, and cubed, with some really large movies, showing how it works from laptop to cloud, would be a pretty amazing use case I think. :slightly_smiling_face:

That sounds incredible. It could hit 3 birds with one stone (be useful to neuro people, show xarray being domain-agnostic, and road-test cubed).

cc’ing @jhamman too for bio-related interest

2 Likes

This sounds amazing!! :fire:

I would definitely be interested in working on this and learning. If there is interest, I’d be happy to reach out to Denise Cai, who is the PI behind minian. She is an awesome person and would probably be happy to get some help with her package since her grad student graduated. :smile: (edit: though I don’t want to assume anything: PIs are busy people :grinning:).

I’ve got a meeting now, but count me in with this it sounds fantastic and I’d learn a lot in the process you all are doing great stuff!

2 Likes

Hi Eric,

Thanks for bringing this up! I’ve made some small efforts to try to generalize Pangeo (which I would argue is the computational underpinning behind much of Pythia) as Pandata, covering all the domain-general aspects of this set of tools and way of working. I’ve got a paper about it at GitHub - panstacks/pandata: The Pandata scalable open-source analysis stack , but beyond that we haven’t gotten very far in defining and characterizing Pandata.

My HoloViz group at Anaconda happens to have four Neuroscience-related PhDs working in it, and we also have a CZI grant to apply these tools to working with neuroscience data, focusing mainly on the visualization aspects, and including some work with Minian. Our efforts so far are at GitHub - holoviz-topics/neuro: HoloViz+Bokeh for Neuroscience , but the work is still underway and thus very much in flux. We’re not calling that work “Panneuro”, but I guess we could. :slight_smile: Sounds like we should probably bring you in as a collaborator, in any case! If so please reach out via DM with your contact details.

3 Likes

Thanks for that – I’ve seen the pandata page, and didn’t realize there was an accompanying paper. I will check it out – that’s absolutely in the spirit of what I am talking about.

I think the hard part, pedagogically, is figuring out what really needs to be taught. I think we all agree on matplotlib, numpy, pandas (with scipy in for free and probably jupyter nowadays). Beyond that, things start to fragment. E.g., I would push for scikit-learn, but I think I may be wrong I’m very biased toward ML. Some would push for xarray, but that fills a fairly niche market with big data.

One option is to separate things into a core and periphery. Let people know these (core things) are the things you should definitely learn. These (periphery) are the things you can pick and choose as modules.

And we’d just have to agree on the core :upside_down_face:

So like a spider web with certain bits at the center and others that depend on your domain-specific needs (this is from the class that I created that didn’t work out very well):

On panneuro@arokem has some really cool thoughts about panneuro, specifically following pangeo: PanNeuro

I met him once and I probably should credit him with this thread. This is basically my attempt to follow in the footsteps for Pythia Foundations what he was pushing for pangeo. :grinning:

[Edit added] PS @jbednar I don’t have DM privileges here yet, but I can be reached at thomson dot eric at gmail.

3 Likes

Thanks to the pointer to Ariel Rokem’s slides; definitely seems relevant!

When trying to conceptualize Pandata, I too started with a core + periphery model, but it really didn’t work – different disciplines make fundamentally different choices on data structures, as you are noticing, with xarray being absolutely essential in climate science, and nearly completely unknown in finance and other fields. NumPy is in the core, sure, but most people don’t often use NumPy directly; they use (and should use) a higher-level data library than that, but they just don’t agree on whether that’s Pandas, or Xarray, or any of the others listed in my figure above. And they certainly don’t agree on file formats. That’s why my figure ended up organized as a checklist (choose your data storage | choose how to access it | choose your data structure or data API (pandas, xarray, etc) | choose your add-on libraries | choose your viz tools | choose your UI). The figure could certainly use work to make the “pick one or more from each column” aspect clearer.

What I think makes Pandata/Pangeo/Panneuro etc. work as a concept is that Jupyter, Dask, Panel, Matplotlib, and hvPlot all work well with all those Data APIs, which lets you choose whichever one is right for your field, and still get Pangeo-level performance and power if you do it right (which is what Pandata is about). A core+periphery model would be really easy to explain, but I really don’t think it’s right for the way these tools work, because people do not (and in my opinion, cannot and should not) agree on the absolutely most core item of all, i.e. how end users will access your data.

3 Likes

James great points and I appreciate that the center-surround structure ultimately stretches to a breaking point. There are two things here that I should be careful not to conflate.

One, the pandata concept, which is a useful conceptual schema for thinking about Python’s data analysis ecosystem. Two, how do we onramp newcomers into this ecosystem, and which parts of it should we actually teach them? This second, didactic, concern is my main question in this thread, and what I’m grappling with in my original post in different iterations. So for instance as a practical matter I wouldn’t include duckdb or datashader in an on-ramp like Pythia Foundations.

I like that Pythia Foundations is very minimalist. Numpy, matplotlib, pandas (and a few geo specific things, and that’s where I’m asking questions: what would we add for neuro, what would we cut if we made a generic science-agnostic version).

The side quest discussed with @TomNicholas above sounds very cool as well, but I see that as a sort of independent third thing. :smile:

2 Likes

Sounds great! Let me know if you need any help with this from the Cubed side.

Thanks,
Tom

3 Likes

Yes, that’s a useful distinction. I’ve already tried to order the figure so that the less-common choices like DuckDB and GraphBLAS are closer to the bottom of the page, so that people will find something they know earlier in the list. Any particular domain (e.g. Panneuro) would focus on a subset of each column, not the whole thing.

That said, being able to handle very large datasets is central to both Pangeo and Pythia. From pangeo.io: “Pangeo is first and foremost a community promoting open, reproducible, and scalable science.” From projectpythia.org: “Project Pythia is the education working group for Pangeo… Together these initiatives are helping geoscientists make sense of huge volumes of numerical scientific data.” So I’d argue that the ability to visualize very large data is not an optional add-on to these projects, and e.g. Matplotlib+Datashader provides that while Matplotlib alone does not. Pythia currently puts Matplotlib in the Foundations and Datashader in the Cookbooks, and I think that’s fine as long as it’s clear to users who do have large data what they should do to make sense of it.

1 Like

Thanks for tagging me here! I wanted to mention a couple of things that will be interesting for folks on this thread. The first is that my 2019 BRAIN Initiative slides eventually evolved into a white paper that puts a bit more meat on the bones of that idea: I’ll admit that I have not had much opportunity to continue exploring these ideas, except in partnering with 2i2c to deliver some of these ideas as part of a summer school that I organize (https://neurohackademy.org/). However, two other groups (that I know of) that have made substantial progress on this are:

  1. DANDI: The BRAIN Initiative archive for data from a variety of physiology modalities has done a large amount of work in developing the standards and infrastructure underlying a DANDIhub, which implements many of the things mentioned in the white paper.

  2. Loren Frank’s lab at UCSF has done a lot of work on data management and provenance tracking. They also collaborate with 2i2c on this. See this HHMI blog post

1 Like

Ariel,
Thanks a lot for the updates! neurohackademy looks great I’ve been watching from afar for some time with admiration. :slightly_smiling_face: Last year I went to the ODIN conference had really good representation from NWB+DANDI. I’m really hoping they continue to grow!

I hadn’t seen the spyglass output from Frank’s lab that seems very cool! :exploding_head:

These are the kinds of great specialist tools I’m talking about (deep lab cut, suite2p, nitime (:smile: ), many many more), but they require a working baseline Python skill set that people often don’t have (there is still a sizable percentage of people I talk to that are using Matlab). I think Pythia Foundations, but shaped for neuro, could potentially be a great didactic resource.

I have run workshops on calcium imaging analysis, and it would be nice to be able to send people to something like Pythia Foundations_neuro and be like “Please do this it will give you what you need before the workshop.” Instead we send out long emails with different resource links for each library. I’d like to stop doing that, and as you’ve realized long ago the geosciences are ahead of the game.