Wednesday February 1st: Xarray-Datatree: Hierarchical Data Structures for Multi-Model Science

DOI

Pangeo Showcase Talk by Tom Nicholas at Columbia University’s Lamont-Doherty Earth Observatory

Bio
Tom works at Columbia University’s Lamont-Doherty Earth Observatory, doing physical oceanography research and a lot of open-source software development. He primarily works on xarray, which he originally got involved with in 2019 during his PhD in plasma physics. Having moved from one field to another, he is particularly interested in showing how xarray and other parts of the pangeo ecosystem can be used in various fields of science.

Abstract
Real scientific workflows often require working with many heterogeneous but related datasets. Examples in geoscience include: (1) scenario simulations by many different climate models in the same intercomparison project, (2) simulation data at multiple resolutions from a convergence scan or sub-grid-scale study, and (3) observational + simulation data of the same region. There is a need for a general high-level data structure which can organize such data in an accessible way, whilst still being flexible enough to adapt to the user’s mental model of their data. It should also be intuitive, so that simple operations such as calculating average climatologies are still simple to express. It should also serialize to a commonly-used data format, so as not to create backwards compatibility problems. The new xarray-datatree [1] package solves these problems, by providing a tree-like hierarchical data structure that is general enough to be useful in a wide variety of cases. Datatree extends xarray - generalizing xarray.Dataset to build upon an interface that many geoscientists are already familiar with. Analysis operations can be mapped over a whole tree, allowing simple operations to be expressed intuitively, even over complex heterogeneous datasets. Datatree is inspired by netCDF: Xarray’s highest-level object is currently an xarray.Dataset, which stores collections of arrays with a shared coordinate system and corresponds to a single group in a netCDF file. A DataTree object is instead a structured hierarchical collection of Datasets, and would map to multiple netCDF groups. Therefore serialization to and from netCDF files is possible with datatree, so backwards compatibility is maintained. We will explain the model of datatree, its relation to netCDF & Zarr, and how to use the data structure to simplify your own work. We will also give examples of using datatree with real geoscience datasets, such as CMIP6 model data. [2]
[1] GitHub - xarray-contrib/datatree: WIP implementation of a tree-like hierarchical data structure for xarray.
[2] Easy IPCC Part 1: Multi-Model Datatree | by Tom Nicholas | pangeo | Medium

2 Likes