I am new to thinking in Machine Learning and stumbled across this problem multiple times:
I want to do Machine Learning with sklearn
on gridded global climate data with dimensions time
, longitude
, latitude
and one variable with outputs in longitude
and latitude
. (I want to train a model on each grid cell independently because climatology depends on lon and lat.)
As far as I understand, sklearn always requires X
as two dimensional array (samples, features) and Y as 1d.
My samples is the time
dimension. My features is the variable.
My question now: Is there any way to use sklearn
independently on every gridcell (longitude
and latitude
) without looping over longitude
and latitude
?
How do you approach this kind of an issue?
My current way: looping over spatial
stacked from longitude
and latitude
:
# set category_obs for target
x.coords['category_obs'] = (categories.dims, categories)
xs=x.stack(spatial=['latitude','longitude']).rename({'valid_time':'sample'})
ys=y.stack(spatial=['latitude','longitude']).rename({'valid_time':'sample'})
elr_dict = {}
proba = []
for si in trange(xs.spatial.size, leave=True, position=0):
X = xs.isel(spatial=si, drop=True)
Y = ys.isel(spatial=si, drop=True)
X = X.expand_dims('feature',axis=-1)
Y = Target(coord="category_obs", transform_func=LabelEncoder().fit_transform)(X)
wrapper = wrap(LogisticRegression(fit_intercept=False))
wrapper.fit(X, Y)
proba.append(wrapper.predict_proba(X))
# save wrapper for later use
elr_dict[si] = wrapper
proba = xr.concat(proba,'spatial', compat='override', coords='minimal')
proba = proba.drop('category_obs').assign_coords(spatial=xs.coords['spatial'].sel(spatial=xs.spatial)).unstack('spatial')
# rename back
proba = proba.rename({'feature':'category','sample':'valid_time'}).assign_coords(category=['below','normal','above'])
The workflow is embarrassingly parallel. I want to apply it on a 1.5 deg global grid: 10000 land grid cells.
But deep inside I think there must be a nicer solution that a loop…
(I know I can have multi-dim outputs in keras
, but I would like to get it done in sklearn
with a LogisticRegression
)
See gist here: LR sklearn-xarray with loop over grid · GitHub
Software wrapping sklearn for xarray objects: GitHub - phausamann/sklearn-xarray: Metadata-aware machine learning.
Issue: Multi-dimensional LogisticRegression · Issue #59 · phausamann/sklearn-xarray · GitHub
Thank you