Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LinAlgError: Matrix is Singular #116

Open
scardonau94 opened this issue May 19, 2022 · 15 comments
Open

LinAlgError: Matrix is Singular #116

scardonau94 opened this issue May 19, 2022 · 15 comments

Comments

@scardonau94
Copy link

Hello everyone
I am trying to fit a GWR model. I am following examples codes and each has the same pipeline. When I measure the "gwr_selector" an issue related to LingAlgError: Matriz is Singular appears. I have 2941 polygons and 20 variables to fit the model. The unique way codes work is to fit them with 150 polygons and 5 variables. Do you know what kind of mistake I am making?
Bests

image

@ljwolf
Copy link
Member

ljwolf commented May 19, 2022

Yes. I think that bw_min is probably too small. If there are not enough observations, the least squares procedure will not be able to invert a local g_X. Here, you've defined this to use, at a minimum, two observations in each local model. If this is the case, then XtX will be singular when the two observations have the same value for any of the variables.

So, try increasing bw_min.

Does this not work by fitting to the full dataset of 2941 polygons and a larger bw_min?

@scardonau94
Copy link
Author

Thank you for your quick answer. I tried different combinations of bw_min and It was not possible to solve the problem.

@ljwolf
Copy link
Member

ljwolf commented May 19, 2022 via email

@scardonau94
Copy link
Author

I have checked colinearity but I did not find variables perfectly colinear.

@TaylorOshan
Copy link
Collaborator

TaylorOshan commented May 19, 2022 via email

@larsiusprime
Copy link

I have had this problem before and found a workaround, not sure if it's a "valid" approach or not:

If the variable that is causing you trouble is a floating point value, you might be able to get away with adding a little bit of random "dust" to it. For instance, I had a particular variable that for 80% of my observations was in the range of 1,000-10,000. But for about 20% of the observations, this variable is flat zero. The flat zeros were causing the issue if they happened to be the only ones in a particular bandwidth range, or so I surmised.

So my solution was to add "dust" to all the variable values. A random amount between 0.00-0.99. My final values will all be rounded to the nearest whole number anyways.

Adding the "dust" makes it so that all the troublesome parcels with the zero value are now technically different from one another. And hopefully the amount on them is so small that it won't meaningfully affect the predictions.

@ljwolf
Copy link
Member

ljwolf commented Jul 28, 2023

Hi @larsiusprime! that's a reasonable way to avoid the singularity issue if you can afford that small bit of random noise in your analysis budget. For most, adding a random value somewhere between [0,1e-4] is probably sufficient.

For any potential developer interested in solving this in our code, the solution would be to swap our current numpy.linalg.inv() to a pseudo-inverse, like pinv_extended() in statsmodels. See, for example, a regression that fits on a perfectly collinear input:

>>> import statsmodels
>>> import numpy
>>> from statsmodels import api as sm
>>> x = numpy.random.random(size=100)
>>> X = numpy.column_stack((numpy.ones_like(x), x, x,)) # perfectly collinear columns 2 & 3
>>> y = X @ numpy.array([[3, -2, 4])).T 
>>> sm.OLS(endog=y, exog=X, hasconst=True).fit().summary()
"""
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 1.130e+31
Date:                Fri, 28 Jul 2023   Prob (F-statistic):               0.00
Time:                        09:51:05   Log-Likelihood:                 3265.6
No. Observations:                 100   AIC:                            -6527.
Df Residuals:                      98   BIC:                            -6522.
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          3.0000   3.66e-16   8.21e+15      0.000       3.000       3.000
x1             1.0000   2.97e-16   3.36e+15      0.000       1.000       1.000
x2             1.0000   2.97e-16   3.36e+15      0.000       1.000       1.000
==============================================================================
Omnibus:                        6.528   Durbin-Watson:                   0.115
Prob(Omnibus):                  0.038   Jarque-Bera (JB):                8.807
Skew:                           0.267   Prob(JB):                       0.0122
Kurtosis:                       4.352   Cond. No.                     1.36e+17
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 9.07e-33. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
"""

@jagreen1
Copy link

jagreen1 commented Dec 20, 2023

Any idea if there will be a fix to this? The program essentially doesn't work and always results in this error.

I had this error 3 years ago, came back to the same project and the error is still present.

@martinfleis
Copy link
Member

@jagreen1 there's an ongoing work to fix this in #134

@jagreen1
Copy link

jagreen1 commented Dec 20, 2023

Good to hear. I have read through that thread, but don't have sufficient expertise myself to contribute. I'm essentially lost as how to progress with a project unless this can be fixed. Cheers.

@ljwolf
Copy link
Member

ljwolf commented Dec 20, 2023

@jagreen1 Can you fit a regular OLS on your data?

from spreg import OLS
OLS(y, x)

If you cannot fit an OLS, then the problem is not with MGWR (#132).

If you can, one way that frequently works is to increase the minimum bandwidth.

This LinAlgError can arise because the "local" model near a given site has all the same values for some feature. Forcing the bandwidth larger prevents this while also preventing overfitting.

You can see that the original poster of this issue is setting min_bw=2, which will pretty much always fail if there's a categorical/one-hot encoded feature.

@jagreen1
Copy link

Yes, there are no problems with the OLS. I'm using a logistic/binomial (0 or 1) dataset of 100k points, and a further subset of just 20k points. It works with some bandwidths and indiscriminately not with others.

@ljwolf
Copy link
Member

ljwolf commented Dec 20, 2023

Interesting, OK. And, to confirm, the issue arises in Sel_BW()?

Do you have any categorical/one-hot features, or are they all continuous?

This is something I've long been interested in conceptually... I hope to have the proof of concept linked above completed by early Jan.

@jagreen1
Copy link

jagreen1 commented Dec 21, 2023

@ljwolf Yes, I can confirm that this issue occurs during sel_bw, in my case for a binomial regression model.

The independent variables are continuous (not categorical), however where data wasn't available I had to assign values of zero. Not sure if that causes an issue.

I have primarily been using the MGWR GUI application, which often has the error LinAlgError: Matrix is Singular.
This occurs for seemingly random bandwidths. For example, for one dataset I tried running the GWR analysis for bandwidths 1770 to 1780 at intervals of 1. The regression ran for a bandwidth of 1780, but not for 1770 through 1779.

MGWR_Error_1 MGWR_Error_3

I decided to try analyzing the data purely in python (not using the GUI), and I now receive a slightly different error when calling sel_bw being IndexError: invalid index to scalar variable. This error doesn't occur when using the default 'Gaussian' model, however given that my dependent variable is binary this isn't appropriate.

MGWR_sel_bw_error

@ljwolf
Copy link
Member

ljwolf commented Dec 21, 2023

had to assign a value of zero

Yes, it would. See @larsiusprime's comment. It's a perfectly useful fix here.

What "Matrix is Singular" means is that the weighted least squares matrix (Xt W X) is not invertible. This is often because some variable in X is perfectly collinear with another variable. If you fill all your missing data with zeros and this missing data occurs more commonly in some localities, then it's entirely possible that you're getting all zeros in some local model for some covariate... like, all x values for sites within the bandwidth are zero. When this happens at one site, that x becomes perfectly collinear with the intercept, that local model becomes degenerate, and the error is thrown.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants