Data models for pangeo

Hey @rabernat! (and any other respondents welcome)

Recall one of the 10 items inside Pangeo’s teaching wheelhouse is presenting the xarray data model which IIRC you referenced to the Common Data Model which appears to be a Microsoft contribution described here. Furthermore there is an xarray data structure document and a NetCDF data model.

My question towards teaching pangeistic skills: What is the relationship between these (and other) perspectives? Or if you like just sketch an optimized learning path as you see it. I would be tempted to take the road from float to list to dictionary to ndarray to pandas Dataframe to xarray DataArray to Dataset and so on from there but that is very ground up approach; so I think it’d be pleasant to have a grand scheme in mind to go with it.

Also as a practical matter we have this interesting idea of writing pangeistic code. Like ‘no for-loops’, a worthy goal.

Suppose I have sensor time-series data streams in a Dataset (I am thinking about in situ profilers from the OOI Regional Cable Array for example). How does the new person learn and internalize skills with .sel and .where and so on? i.e. translating their data-focused question into Pangeo Python. I will elaborate on this in the POETs repo in further detail; just to call it out here to expand the surface area of the problem.

I’m probably jumping into the middle of this conversation… but a few thoughts from my experience.

  1. CDM - My take is that this is more a philosophy… i.e. communities should establish and use a CDM appropriate to their needs. And wherever possible, established standards should be used. In Oceanography, the NODC NetCDF templates are a good place to start. Put your data in this format, and it can easily be integrated into commonly used tools like Panoply and ERDDAP.

  2. There’s really two parts to this… getting your data in the right format (e.g. timeseries, profiles, 4D grids) and including appropriate, required and standardized metadata. E.g. CF Conventions.

  3. As for dimensions and data-types (aka objects in Python), I think I would try to distinguish between the two…

Coming from a Matlab world, things were a bit easier. First there are dimensions: Scaler (single value), Vector (columns of values) and Arrays (2D, 3D… ND). Then there are data types: e.g. integers, floats, strings (and many more annoying ones, like char and datetime). Structured arrays are great too… and very python-dictionary like.

Now in the Python world we have objects galore, and it’s often a challenge to mentally switch between them (lists and dictionaries being a good example).

In my training sessions, I’ve generally glossed over most of this, and basically stuck to trying to explain the differences and advantages between Pandas DataFrames (basically Excel tables - great for lots of individual measurements with 1…n variables/columns) and Xarray DataSets (which are great for multi-dimensional datasets, which of course is our bread-and-butter in geoscience).

In many cases, DataFrames and DataSets are interchangeable. The OOI dataset is a good example, where most instruments provide simple timeseries with multiple variables. So, which type you use really depends on the additional features of the library you wish to use (well, that and performance). But if you had a more complicated dataset, like a profiler dataset that had the dimensions Time and Depth, or optical data with a wavelength dimension, then xarray becomes essential.

@robfatland - As I understand it, there is no relationship between the Microsoft Common Data Model and the Unidata Common Data Model (beyond the name). The Unidata Common Data Model, NetCDF data model, and Xarray data model are all analogous. tldr; I think we can set the Microsoft one aside for the purposes of this conversation.

@seagrinch @jhamman Aye, thanks. Creating a POETs README to start this topic reflecting your initial input