Standard packages/procedures for multiple linear regression analysis

Hello, Pangeo community!
I am a graduate student in Physical Oceanography hoping to contribute to the development of xarray and Pangeo itself one day :slight_smile:
I am only beginning to learn these tools though. Sorry if it is going to sound like a very basic question.
Somehow it is not easy to find an open source notebooks with multiple regression in Earth Sciences.
Iā€™d be curious how people do it and if there is a standard way to do it.
From actual talking to people in my institution I gathered that about half of the people write their own model from scratch (using matrix formulation) and half is relying on existing solutions. Too many people use Matlab.
I want to perform a multiple linear regression analysis using principal components I extracted from a bunch of tide gauges along the East Coast and satellite altimetry across a wider domain in the Atlantic.
Would it be a standard thing to just use statsmodels.regression.linear_model.OLS?
Does it work well with xarray dataset? How do I make it work for the whole field (if I have to establish regression coefficient in each grid cell of altimetry data regressed onto PCs).

Thank you!
Yuta

2 Likes

Welcome Yuta!

In addition to statsmodels, a very common package used in the Pangeo community for multiple linear regression would be scikit-learn.

Unfortunately, as with most machine-learning libraries, there is no direct support for xarray inputs / outputs in scikit learn. But since you are just working with principal components (1D), you should be able to convert your xarray data to pandas dataframe easily.

This looks like a nice tutorial:

Good luck!

1 Like

Thank you so much, Ryan! For the welcome and the useful links!

1 Like