Standard packages/procedures for multiple linear regression analysis

Hello, Pangeo community!
I am a graduate student in Physical Oceanography hoping to contribute to the development of xarray and Pangeo itself one day :slight_smile:
I am only beginning to learn these tools though. Sorry if it is going to sound like a very basic question.
Somehow it is not easy to find an open source notebooks with multiple regression in Earth Sciences.
I’d be curious how people do it and if there is a standard way to do it.
From actual talking to people in my institution I gathered that about half of the people write their own model from scratch (using matrix formulation) and half is relying on existing solutions. Too many people use Matlab.
I want to perform a multiple linear regression analysis using principal components I extracted from a bunch of tide gauges along the East Coast and satellite altimetry across a wider domain in the Atlantic.
Would it be a standard thing to just use statsmodels.regression.linear_model.OLS?
Does it work well with xarray dataset? How do I make it work for the whole field (if I have to establish regression coefficient in each grid cell of altimetry data regressed onto PCs).

Thank you!
Yuta

2 Likes

Welcome Yuta!

In addition to statsmodels, a very common package used in the Pangeo community for multiple linear regression would be scikit-learn.

Unfortunately, as with most machine-learning libraries, there is no direct support for xarray inputs / outputs in scikit learn. But since you are just working with principal components (1D), you should be able to convert your xarray data to pandas dataframe easily.

This looks like a nice tutorial:

Good luck!

1 Like

Thank you so much, Ryan! For the welcome and the useful links!

1 Like