Vectorized sklearn

I am new to thinking in Machine Learning and stumbled across this problem multiple times:

I want to do Machine Learning with sklearn on gridded global climate data with dimensions time, longitude, latitude and one variable with outputs in longitude and latitude. (I want to train a model on each grid cell independently because climatology depends on lon and lat.)

As far as I understand, sklearn always requires X as two dimensional array (samples, features) and Y as 1d.
My samples is the time dimension. My features is the variable.

My question now: Is there any way to use sklearn independently on every gridcell (longitude and latitude) without looping over longitude and latitude?

How do you approach this kind of an issue?

My current way: looping over spatial stacked from longitude and latitude:

# set category_obs for target
x.coords['category_obs'] = (categories.dims, categories)

elr_dict = {}
proba = []
for si in trange(xs.spatial.size, leave=True, position=0):

    X = xs.isel(spatial=si, drop=True)
    Y = ys.isel(spatial=si, drop=True)

    X = X.expand_dims('feature',axis=-1)

    Y = Target(coord="category_obs", transform_func=LabelEncoder().fit_transform)(X)

    wrapper = wrap(LogisticRegression(fit_intercept=False)), Y)

    # save wrapper for later use
    elr_dict[si] = wrapper

proba = xr.concat(proba,'spatial', compat='override', coords='minimal')
proba = proba.drop('category_obs').assign_coords(spatial=xs.coords['spatial'].sel(spatial=xs.spatial)).unstack('spatial')
# rename back
proba = proba.rename({'feature':'category','sample':'valid_time'}).assign_coords(category=['below','normal','above'])

The workflow is embarrassingly parallel. I want to apply it on a 1.5 deg global grid: 10000 land grid cells.
But deep inside I think there must be a nicer solution that a loop…

(I know I can have multi-dim outputs in keras, but I would like to get it done in sklearn with a LogisticRegression)

See gist here: LR sklearn-xarray with loop over grid · GitHub

Software wrapping sklearn for xarray objects: GitHub - phausamann/sklearn-xarray: Metadata-aware machine learning.
Issue: Multi-dimensional LogisticRegression · Issue #59 · phausamann/sklearn-xarray · GitHub

Thank you

Sounds like what I’ve done at work. I’m afraid the code is in a private company bitbucket repo, but I don’t see why I can’t outline the approach. I’d like one day to opensource the package, but I’m a bit of a noob with geospatial and dask so I’m sure there will be better qualified developers. Anyway, caveat aside…

I convert the dataset to an array using .to_array. One trick I’ve learned is to reindex the predictors according to the feature order expected by the sklearn object (crucial). I also rechunk so I’m not chunking the predictor features. Chunking spatially is good, of course. xarray’s apply_ufunc() is useful and your chunks should fit into worker memory as numpy arrays. There’s a bit of reshaping as well. I’m still in the middle of refactoring and unit-testing the package to be honest, so it’s been a while since I broke the methods apart to inspect the data structure at each point.

Then what? So for an sklearn object, I’ve found xarray’s apply_ufunc useful (although damned hard work to get right, with faffing with input and output core dims and reshaping. This also leverages plain sklearn estimators. I should also look at leveraging dask ML. Sorry a bit fragmented as I’m supposed to be listening to a company briefing. :wink: But maybe gives you a pointer? Your workload may be a bit different to mine anyway.

Broadly, the key bit of my package is a class with methods you can chain, so I can make one call to load a bunch of geoTiffs and can chain methods to intersect with ground truth in a shapefile and output that data table to parquet. When I have a trained model, I can use the same data load function and then call a predict method that handles the reshaping and calling of apply_ufunc and gives you back a spatial grid of projections. If anyone thought this would be useful, I’m open to taking this to management about opensourcing.

1 Like

Thanks a lot.

So you define a function f with sklearn calls and then use xr.apply_ufunc(da, f), like

def f_sklearn(training_data_with_training_target, validation_data_with_validation_target):
    X = some_reshaping_and_renaming(training_data_with_training_target)
    Y = Target(coord="category_obs", transform_func=LabelEncoder().fit_transform)(X)
    wrapper = wrap(LogisticRegression(fit_intercept=False)), Y)
    return wrapper.predict_proba(X)
    # or use validation data for outofsample
    X2 = some_reshaping_and_renaming(training_data_with_training_target)
    return wrapper.predict_proba(X2)

xr.apply_ufunc(input, f_sklearn, **magic_here_also)


Our did you in the end write a new class where you have a wrapper for each grid cell?

In the long run having this publicly available in package like Multi-dimensional LogisticRegression · Issue #59 · phausamann/sklearn-xarray · GitHub or your own would be very valuable.

1 Like

Hmm, for the data wrangling prior to model development, I simply used make_geocube to add a raster layer of the class labels (from a geoDataFrame). This is likely a weak point, as I’d prefer this to create a delayed object. Then it’s a case of converting your xarray Dataset (I think) into a DataArray along the predictors dimension. My memory will be better when I revisit this bit of code to add some unit testing. :wink: I think you’ll have a multidimension (>2) array at this point. I then filter out any nans and wrangle what’s left into a dask dataframe with columns as features and rows as pixels that’s then saved as parquet. So you’ve got parquet files from the dask_dataframe that has features as columns as pixels as rows. This effectively treats pixels independently. Yes, yes, I know.

Given our typical sparse labelled data, this has gathered our widely scattered, labelled pixels into something that easily fits in memory. So now it’s conventional data science, which gets you a long way! Save your pickled trained model, noting the ordering of your predictor columns, of course.

For inference, you can load your pickeled model. You can also load your new data. Repeat the same step as above to create that (I think) multidimensional DataArray that has the predictors in the same order as your model expects. Then I have a function that takes an actual multidimensional ndarray, flattens it to 2D so we have an n x p array where n is the number of pixels and p the number of features. So we’ve now lost the spatial grid and have a 2D array exactly of the form to throw into our model’s .predict() method like any other bog standard numpy array of samples of predictors. This helper function then reshapes the output ndarray from .predict() into the original shape (so reconstructing the spatial grid for our predictions). This works both for .predict() and .predict_proba() - you just have to do careful housekeeping of your ndarray shape and make use of reshape(blah, -1) to deal with N classes.

So now we can call apply_ufunc, passing it our helper function and our multidimensional (dask delayed) DataArray (which, from memory, will be 3D, with a dimension for predictors over each of our 2D spatial dimensions). Don’t forget to tell apply_ufunc your input core dims are the predictors, and call the output core dims, say, “class” and tell it the number of classes (which you know from your model). I don’t mind saying this bit drove me nuts. I really didn’t know what I was doing at the time. But the net result at this step is that your input is a dask-delayed DataArray with good old x and y coordinates, and the output is an actual ndarray of predictions with the same good old x and y coordinates.

What you do then is, I guess, up to you. I convert it to a Dataset, specifying “class” as the dim and rename the output class prediction variables. This is handy for then writing to netCDF. You can create a delayed netCDF writer so can write all these steps without actually triggering compute. Then you can hit the button, specifying your dask cluster, your dask workers fire up and the predictions get written to disk chunk by chunk. Or you get a WinError 955. :wink: But since wrapping these steps into chained methods, I find the dask task list is pretty small and the scheduler doesn’t seem to get the same migraines I used to give it and the workflow is the most stable it’s ever been. You then have a netCDF you can do what you want with, e.g. extract Tiffs etc. NB: I’ve never managed to actually get the final TIFF to properly be aware of the CRS, but I do this an S2 tile at a time, so know it’s the same CRS as the input tile.

Sorry that’s a bit waffly, but I think there’s a bit of sensitivity to sharing code at the moment, unfortunately.

Pretty detailed waffle though Guy! :slight_smile:

1 Like

There are things along these general lines here, too I think:- Scalable Supervised Machine Learning on the Open Data Cube — Digital Earth Africa 2021 documentation


Ah, I’ll look forward to reading that resource, thanks @RichardScottOZ .
It may not have been around last year when I needed something. It also relies on the datacube, and I was having to load images from files. I’m aware of the ability to load pixels from geometries, but I’m definitely going to check out their approach. I note, briefly, that their notebook refers to the number of CPUs. I assume/hope it plays nice with dask as well.

Yeah, those are new I think.

I have been tinkering with what can/cannot extract from that codebase and running it on some filesystem based things here and there.

They are pretty well dask capable, yes, in whatever they are doing.

See recent Oceania post.