Xarray for raster data (DEMs) with inconsistent spatial extent

JessicaS11 · August 31, 2020, 8:42pm

I’ve been spinning up a new project and have been trying to use xarray, but I’m remembering why I abandoned it years ago the first time I tried to use it. It’s not intuitive to teach yourself (e.g. a DataArray automatically has a variable included that’s not listed in the output, but a DataSet lists the variables by name, even if there’s only one), and the examples/resources are thin to nonexistent if your data don’t fit into nice, neat consistent grids (whether that’s because they have different spatial resolutions or different extents, projections aside here!). I’ve found a few docs/examples that mention reprojection (including rioxarray), but none actually provide much information on what’s happening/how.

Ultimately, I’m hoping to read in and stack a series of DEMs (digital elevation models). Each one has, a priori, an unknown extent (by me - so the file or metadata would have to be read/created to determine it), and though they are all 2 m resolution, the pixel centerpoints don’t necessarily match. When I bring the geotiffs into xarray (xarray.open_rasterio('path/to/DEM')), they have x, y and band coordinates/dimensions. Ultimately, I was hoping to have an xarray with time, x, and y dimensions and then a few data variables (for now just elevation, ultimately a few calculation outputs based on those elevations). The clearest workflow I can envision is to bring each geotiff in, manually add the time coordinate/dimension (and value), remove the band coordinate/dimension, and add it to a framework dataset I’ve set up. The outstanding issues are:

This feels highly inefficient, to have to modify each dataarray so I can combine them
If I understand xarray correctly, I should be able to have x and y as coordinates/dimensions, but each time would have a unique set of x and y values, so x and y would depend on time. I’d have to impose this structure on each geotiff I import in order to combine multiple dataarrays
In order to have a single set of x and y values (rather than one for each time step), I will need to resample both my x and y grids AND the raster itself to match the unified grid. Will xarray or rioxarray handle this, and how do I tell it what to do (e.g. how I want the data resampled)?
What functions should I be using for these operations to make sure that I’m not making a meaningless dataset?
Should I even be using xarray in this way? It’d be easy enough to bring in each raster by itself, do some computations, and get the results. I’d ultimately like to have a stacked/compiled dataset, but this could be done independently from the processed DEMs later.

I reached out to @dshean and @shashank on Slack, and they suggested that this isn’t the first time they’ve had this conversation. However, I’ve found little public discussion about it and what might be happening at the community/xarray level towards solutions. I’m hoping that the Pangeo community can shed some insight, because I’d really like to be able to leverage xarray!

scottyhq · August 31, 2020, 11:28pm

Hi @JessicaS11, I encourage you to stick with it! it is generally tricky going from a set of 2D images to a 3D cube, especially if the 2D images start out in different projections or mis-aligned grids. Fundamentally, xarray is relying on rasterio/gdal for easily opening all the myriad geospatial raster formats that exist out there, and these underlying libraries are really designed for 2D rather than 3D or ND data. Here is a relatively short and recent issue on the xarray issue tracker that provides some context https://github.com/pydata/xarray/issues/4142.

That said, this example does cover some of what you’re trying to do https://gallery.pangeo.io/repos/pangeo-data/landsat-8-tutorial-gallery/landsat8.html. See cells 20-21 in particular. Perhaps that code could be incorporated into an xarray function so that it’s easier for a user. Importantly though, this is a specific case that works because all the 2D images have the same projection and transform (but not the same size), so xarray can easily align the grids without reprojection and interpolation.

If you need to reproject, I often find myself 1) first working with GDAL command line tools where you have finer-grained control over 2D image warping (https://gdal.org/programs/gdalwarp.html) and then 2) build a multi-band VRT dataset with https://gdal.org/programs/gdalbuildvrt.html. You can then use xr.open_rasterio('stack.vrt') to get a multidimensional dataset for further post-processing and analysis. This could also likely now be accomplished with rioxarray, which in my mind “is to xarray as geopandas is to pandas”. So for operations requiring knowledge of the coordinate reference system (CRS), definitely give it a try!

Agreed that some additional simple and well-documented examples of these workflows would go a long ways.

JessicaS11 · September 18, 2020, 3:43pm

Thanks to @rabernat for some great help on XArray basics during yesterday’s coffee time! I’ve run into a few more stumbling blocks though. In a nutshell, I’m using .groupby and .apply to run a function on my dataset. However, I cannot get the results to be added to the dataset. The function outputs are arrays of varying lengths that I want to put into the dataset as a variable that depends only on the dimensional coordinate over which I am doing the groupby. I’ve created a mwe notebook (rendered) that’s on GitHub to try and capture the various approaches I’ve tried. I’d be grateful for anyone who can explain the gap between my logic/expectation and communicating that effectively to xarray!

rabernat · September 21, 2020, 1:27am

This seems like good progress.

Based on examining your notebook, it seems like the problem is that your functions are not consuming / producing Datasets / DataArrays. That’s what groupby.apply(fn) expects the fn to do.

Your original function was this

def fn(x,y,elev,z):
    misc_arr = np.array([x,y/2,elev*z])
    return misc_arr

This has two problems:

It has many inputs, which it expects to be numpy arrays or individual values
It does not return an Xarray object

It looks like you’re trying to apply multiple different functions at the same time. Let’s just try one at a time first.

z = 7
def fn(ds):
    # some random function I just made uup
    a = z * ds.elev**2 / ds.y
ds.groupby('t').apply(fn)

This works fine.

If we want to do multiple functions, we could just call groupby several times, with different functions. Or we could return a DataSet.

z = 7
def fn(ds):
    # to merge DataArrays into a Dataset, each DataArray needs a name
    a = (z * ds.elev**2 / ds.y).rename('a')
    b = (ds.x * np.cos(ds.elev)).rename('b')
    return xr.merge([a, b])

ds.groupby('t').apply(fn)

A pattern to avoid is calling .values on xarray objects within groupby operations. This discards all the labels and metadata, which Xarray will need in order to put your dataset back together (the “combine” part of the split-apply-combine workflow.)

Also keep in mind that arithmetic / numpy operations on xarray objects usually “just work”, so you usually can just write numpy-style code in such functions (as in my examples above).

JessicaS11 · September 23, 2020, 6:22pm

UPDATED: I’ve continued digging and experimenting. I think (among other issues) my dummy example functions muddled the issue. I’ll still need to address the .values issues mentioned by @rabernat, but one thing at a time. The first and most pressing question is whether or not I’m trying to add a variable in a way that xarray can handle. If I have a dataset:

<xarray.Dataset>
Dimensions:    (dtime: 2, x: 459, y: 363)
Coordinates:
  * x          (x) float64 -2.495e+05 -2.494e+05 ... -2.266e+05 -2.266e+05
  * y          (y) float64 -2.269e+06 -2.269e+06 ... -2.251e+06 -2.251e+06
  * dtime   (dtime)   datetime64[ns] 2012-08-16T10:33:00 2012-08-16T22:33:00
Data variables:
    elevation  (dtime, y, x) float32 nan nan nan nan nan nan ... nan nan nan nan nan
Attributes:
    transform:      (50.0, 0.0, -248400.0, 0.0, -50.0, -2253200.0)
    crs:            +init=epsg:3413

I’d like to get to:

<xarray.Dataset>
Dimensions:    (dtime: 2, x: 459, y: 363)
Coordinates:
  * x          (x) float64 -2.495e+05 -2.494e+05 ... -2.266e+05 -2.266e+05
  * y          (y) float64 -2.269e+06 -2.269e+06 ... -2.251e+06 -2.251e+06
  * dtime   (dtime)   datetime64[ns] 2012-08-16T10:33:00 2012-08-16T22:33:00
Data variables:
    elevation  (dtime, y, x) float32 nan nan nan nan nan nan ... nan nan nan nan nan
    new_var (dtime) object array-of-arrays-here***
Attributes:
    transform:      (50.0, 0.0, -248400.0, 0.0, -50.0, -2253200.0)
    crs:            +init=epsg:3413

***each dataset.new_var entry would be of the form [array([[a,b],[c,d],[e,f],[a,b]]), array([[z,y],[x,w],[v,u],[t,s],[r,p],[z,y]], array([...]), array([...])].

Back to my mwe, I’ve created a more appropriate dummy function output that looks like the actual output I’d expect from my calculations: misc_arr = [np.array([[1,2],[3,4],[5,6],[1,2]]), np.array([[9,8],[7,6],[5,4],[3,2],[1,0],[9,8]])]

I would like for this ENTIRE array, with all of its subarrays, to be the value for ds.new_var.iloc[t=0]. I’m starting to think this isn’t possible (xarray can’t handle the variable lengths of the values within new_var unless there is also another dim that can be a meaningless index the length of the longest entry within ds.new_var, which is not known a priori).

JessicaS11 · September 24, 2020, 8:08pm

To summarize: how can I iterate over a dimension and add the results as a variable that depends ONLY on that dimension? Is there an alternative function that behaves similar to groupby that will allow me to iterate over dimension t and also let me add a new variable value to that same dimension? Obviously trying to do this using groupby returns a ValueError (dimension ‘t’ already exists as a scalar variable). The two potential solutions I’ve thought of so far are:

create a dummy dimension t_fake and then reset the dimensions on the new variables when I’m done (assuming that’s a straightforward task)
Manually iterate over the time dimension (this seems to be exactly the opposite the point of xarray)

dcherian · September 25, 2020, 5:10pm

Yes this is right. You might look at awkward array.

Can you post an example? This should work with groupby IIUC.

JessicaS11 · September 25, 2020, 8:40pm

Thanks for the feedback, @dcherian! I discovered the squeeze flag for groupby (which defaults to True, making the dimension over which you’ve done the groupby disappear), which helped solve the dimension issue.

I’m excited to look into awkward array more, because on first glance it looks like it’s beginning to address some of the issues I often have working with vector data. Do you have any sense of how it distinguishes itself from Pandas in this regard?

I was able to finally find a solution by converting each individual array of arrays into a Pandas series with dtype=object, so I was then able to add my results one variable length array at a time along my time dimension. Thanks @jbusecke for this suggestion!

@rabernat, I’ve been taking some notes and ultimately hope to put together some info on my journey into xarray and getting through some of these [geospatial] and basic understanding xarray issues. Do you have a place/format you’d recommend?

dcherian · September 25, 2020, 9:06pm

Not sure about awkward vs pandas, pandas is still a “neat table” in some way so maybe it won’t work for more awkward datasets?

awkward is NEP-18 compatible so in theory, Xarray could wrap it (https://github.com/pydata/xarray/issues/4285). In practice, it probably doesn’t work yet

jpivarski · December 30, 2023, 6:48pm

Follow up on this topic at Why does this library exist? · jpivarski/ragged · Discussion #6 · GitHub

Michael_Sumner · January 6, 2024, 6:22am

to me the OP sounds like a group by (time) for the inputs then a call to the warper for all source in a time step, the warper api resolves all input details to a target grid (with resampling option)

I agree rioxarray seems ill suited to this, I prefer to use osgeo.gdal - the resulting grids (materialized or virtual) could then fit into the xarray frame

got an example? even a tiny set of files would be fine, or if it’s of interest I could construct one

Topic		Replies	Views
Xarray Accessor for Georeferenced Data Comparisons	6	452	March 23, 2023
Variable Type changing when using Xarray Data	11	4248	October 9, 2021
XARRAY: ValueError: cannot reindex or align along dimension ‘x’ because the index has duplicate values Data	4	3035	May 8, 2023
Xarray not reading in unused dimensions Data	2	311	October 18, 2023
Nodatavals attribute in geoTIFFs and xarray Data	4	1790	April 17, 2021

Xarray for raster data (DEMs) with inconsistent spatial extent

Related topics