Xarray for raster data (DEMs) with inconsistent spatial extent

I’ve been spinning up a new project and have been trying to use xarray, but I’m remembering why I abandoned it years ago the first time I tried to use it. It’s not intuitive to teach yourself (e.g. a DataArray automatically has a variable included that’s not listed in the output, but a DataSet lists the variables by name, even if there’s only one), and the examples/resources are thin to nonexistent if your data don’t fit into nice, neat consistent grids (whether that’s because they have different spatial resolutions or different extents, projections aside here!). I’ve found a few docs/examples that mention reprojection (including rioxarray), but none actually provide much information on what’s happening/how.

Ultimately, I’m hoping to read in and stack a series of DEMs (digital elevation models). Each one has, a priori, an unknown extent (by me - so the file or metadata would have to be read/created to determine it), and though they are all 2 m resolution, the pixel centerpoints don’t necessarily match. When I bring the geotiffs into xarray (xarray.open_rasterio('path/to/DEM')), they have x, y and band coordinates/dimensions. Ultimately, I was hoping to have an xarray with time, x, and y dimensions and then a few data variables (for now just elevation, ultimately a few calculation outputs based on those elevations). The clearest workflow I can envision is to bring each geotiff in, manually add the time coordinate/dimension (and value), remove the band coordinate/dimension, and add it to a framework dataset I’ve set up. The outstanding issues are:

  • This feels highly inefficient, to have to modify each dataarray so I can combine them
  • If I understand xarray correctly, I should be able to have x and y as coordinates/dimensions, but each time would have a unique set of x and y values, so x and y would depend on time. I’d have to impose this structure on each geotiff I import in order to combine multiple dataarrays
  • In order to have a single set of x and y values (rather than one for each time step), I will need to resample both my x and y grids AND the raster itself to match the unified grid. Will xarray or rioxarray handle this, and how do I tell it what to do (e.g. how I want the data resampled)?
  • What functions should I be using for these operations to make sure that I’m not making a meaningless dataset?
  • Should I even be using xarray in this way? It’d be easy enough to bring in each raster by itself, do some computations, and get the results. I’d ultimately like to have a stacked/compiled dataset, but this could be done independently from the processed DEMs later.

I reached out to @dshean and @shashank on Slack, and they suggested that this isn’t the first time they’ve had this conversation. However, I’ve found little public discussion about it and what might be happening at the community/xarray level towards solutions. I’m hoping that the Pangeo community can shed some insight, because I’d really like to be able to leverage xarray!

1 Like

Hi @JessicaS11, I encourage you to stick with it! it is generally tricky going from a set of 2D images to a 3D cube, especially if the 2D images start out in different projections or mis-aligned grids. Fundamentally, xarray is relying on rasterio/gdal for easily opening all the myriad geospatial raster formats that exist out there, and these underlying libraries are really designed for 2D rather than 3D or ND data. Here is a relatively short and recent issue on the xarray issue tracker that provides some context https://github.com/pydata/xarray/issues/4142.

That said, this example does cover some of what you’re trying to do https://gallery.pangeo.io/repos/pangeo-data/landsat-8-tutorial-gallery/landsat8.html. See cells 20-21 in particular. Perhaps that code could be incorporated into an xarray function so that it’s easier for a user. Importantly though, this is a specific case that works because all the 2D images have the same projection and transform (but not the same size), so xarray can easily align the grids without reprojection and interpolation.

If you need to reproject, I often find myself 1) first working with GDAL command line tools where you have finer-grained control over 2D image warping (https://gdal.org/programs/gdalwarp.html) and then 2) build a multi-band VRT dataset with https://gdal.org/programs/gdalbuildvrt.html. You can then use xr.open_rasterio('stack.vrt') to get a multidimensional dataset for further post-processing and analysis. This could also likely now be accomplished with rioxarray, which in my mind “is to xarray as geopandas is to pandas”. So for operations requiring knowledge of the coordinate reference system (CRS), definitely give it a try!

Agreed that some additional simple and well-documented examples of these workflows would go a long ways.

1 Like

Thanks to @rabernat for some great help on XArray basics during yesterday’s coffee time! I’ve run into a few more stumbling blocks though. In a nutshell, I’m using .groupby and .apply to run a function on my dataset. However, I cannot get the results to be added to the dataset. The function outputs are arrays of varying lengths that I want to put into the dataset as a variable that depends only on the dimensional coordinate over which I am doing the groupby. I’ve created a mwe notebook (rendered) that’s on GitHub to try and capture the various approaches I’ve tried. I’d be grateful for anyone who can explain the gap between my logic/expectation and communicating that effectively to xarray!

1 Like

This seems like good progress.

Based on examining your notebook, it seems like the problem is that your functions are not consuming / producing Datasets / DataArrays. That’s what groupby.apply(fn) expects the fn to do.

Your original function was this

def fn(x,y,elev,z):
    misc_arr = np.array([x,y/2,elev*z])
    return misc_arr

This has two problems:

  • It has many inputs, which it expects to be numpy arrays or individual values
  • It does not return an Xarray object

It looks like you’re trying to apply multiple different functions at the same time. Let’s just try one at a time first.

z = 7
def fn(ds):
    # some random function I just made uup
    a = z * ds.elev**2 / ds.y
ds.groupby('t').apply(fn)

This works fine.

If we want to do multiple functions, we could just call groupby several times, with different functions. Or we could return a DataSet.

z = 7
def fn(ds):
    # to merge DataArrays into a Dataset, each DataArray needs a name
    a = (z * ds.elev**2 / ds.y).rename('a')
    b = (ds.x * np.cos(ds.elev)).rename('b')
    return xr.merge([a, b])

ds.groupby('t').apply(fn)

A pattern to avoid is calling .values on xarray objects within groupby operations. This discards all the labels and metadata, which Xarray will need in order to put your dataset back together (the “combine” part of the split-apply-combine workflow.)

Also keep in mind that arithmetic / numpy operations on xarray objects usually “just work”, so you usually can just write numpy-style code in such functions (as in my examples above).

UPDATED: I’ve continued digging and experimenting. I think (among other issues) my dummy example functions muddled the issue. I’ll still need to address the .values issues mentioned by @rabernat, but one thing at a time. The first and most pressing question is whether or not I’m trying to add a variable in a way that xarray can handle. If I have a dataset:

<xarray.Dataset>
Dimensions:    (dtime: 2, x: 459, y: 363)
Coordinates:
  * x          (x) float64 -2.495e+05 -2.494e+05 ... -2.266e+05 -2.266e+05
  * y          (y) float64 -2.269e+06 -2.269e+06 ... -2.251e+06 -2.251e+06
  * dtime   (dtime)   datetime64[ns] 2012-08-16T10:33:00 2012-08-16T22:33:00
Data variables:
    elevation  (dtime, y, x) float32 nan nan nan nan nan nan ... nan nan nan nan nan
Attributes:
    transform:      (50.0, 0.0, -248400.0, 0.0, -50.0, -2253200.0)
    crs:            +init=epsg:3413

I’d like to get to:

<xarray.Dataset>
Dimensions:    (dtime: 2, x: 459, y: 363)
Coordinates:
  * x          (x) float64 -2.495e+05 -2.494e+05 ... -2.266e+05 -2.266e+05
  * y          (y) float64 -2.269e+06 -2.269e+06 ... -2.251e+06 -2.251e+06
  * dtime   (dtime)   datetime64[ns] 2012-08-16T10:33:00 2012-08-16T22:33:00
Data variables:
    elevation  (dtime, y, x) float32 nan nan nan nan nan nan ... nan nan nan nan nan
    new_var (dtime) object array-of-arrays-here***
Attributes:
    transform:      (50.0, 0.0, -248400.0, 0.0, -50.0, -2253200.0)
    crs:            +init=epsg:3413

***each dataset.new_var entry would be of the form [array([[a,b],[c,d],[e,f],[a,b]]), array([[z,y],[x,w],[v,u],[t,s],[r,p],[z,y]], array([...]), array([...])].

Back to my mwe, I’ve created a more appropriate dummy function output that looks like the actual output I’d expect from my calculations: misc_arr = [np.array([[1,2],[3,4],[5,6],[1,2]]), np.array([[9,8],[7,6],[5,4],[3,2],[1,0],[9,8]])]

I would like for this ENTIRE array, with all of its subarrays, to be the value for ds.new_var.iloc[t=0]. I’m starting to think this isn’t possible (xarray can’t handle the variable lengths of the values within new_var unless there is also another dim that can be a meaningless index the length of the longest entry within ds.new_var, which is not known a priori).

To summarize: how can I iterate over a dimension and add the results as a variable that depends ONLY on that dimension? Is there an alternative function that behaves similar to groupby that will allow me to iterate over dimension t and also let me add a new variable value to that same dimension? Obviously trying to do this using groupby returns a ValueError (dimension ‘t’ already exists as a scalar variable). The two potential solutions I’ve thought of so far are:

  1. create a dummy dimension t_fake and then reset the dimensions on the new variables when I’m done (assuming that’s a straightforward task)
  2. Manually iterate over the time dimension (this seems to be exactly the opposite the point of xarray)
1 Like

Yes this is right. You might look at awkward array.

Can you post an example? This should work with groupby IIUC.

1 Like

Thanks for the feedback, @dcherian! I discovered the squeeze flag for groupby (which defaults to True, making the dimension over which you’ve done the groupby disappear), which helped solve the dimension issue.

I’m excited to look into awkward array more, because on first glance it looks like it’s beginning to address some of the issues I often have working with vector data. Do you have any sense of how it distinguishes itself from Pandas in this regard?

I was able to finally find a solution by converting each individual array of arrays into a Pandas series with dtype=object, so I was then able to add my results one variable length array at a time along my time dimension. Thanks @jbusecke for this suggestion!

@rabernat, I’ve been taking some notes and ultimately hope to put together some info on my journey into xarray and getting through some of these [geospatial] and basic understanding xarray issues. Do you have a place/format you’d recommend?

2 Likes

Not sure about awkward vs pandas, pandas is still a “neat table” in some way so maybe it won’t work for more awkward datasets?

awkward is NEP-18 compatible so in theory, Xarray could wrap it (https://github.com/pydata/xarray/issues/4285). In practice, it probably doesn’t work yet :slight_smile:

1 Like

Follow up on this topic at Why does this library exist? · jpivarski/ragged · Discussion #6 · GitHub

to me the OP sounds like a group by (time) for the inputs then a call to the warper for all source in a time step, the warper api resolves all input details to a target grid (with resampling option)

I agree rioxarray seems ill suited to this, I prefer to use osgeo.gdal - the resulting grids (materialized or virtual) could then fit into the xarray frame

got an example? even a tiny set of files would be fine, or if it’s of interest I could construct one