问题
The following trivial example returns a singular matrix. Why? Any ways to overcome it?
In: from scipy.stats import gaussian_kde
Out:
In: points
Out: (array([63, 84]), array([46, 42]))
In: gaussian_kde(points)
Out: (array([63, 84]), array([46, 42]))
LinAlgError: singular matrix
回答1:
Looking at the backtrace, you can see it fails when inverting the covariance matrix. This is due to exact multicollinearity of your data. From the page, you have multicollinearity in your data if two variables are collinear, i.e. if
the correlation between two independent variables is equal to 1 or -1
In this case, the two variables have only two samples, and they are always collinear (trivially, there exists always one line passing two distinct points). We can check that:
np.corrcoef(array([63,84]),array([46,42]))
[[ 1. -1.]
[-1. 1.]]
To not be necessarily collinear, two variables must have at least n=3
samples. To add to this constraint, you have the limitation pointed out by ali_m, that the number of samples n
should be greater or equal to the number of variables p
. Putting the two together,
n>=max(3,p)
in this case p=2
and n>=3
is the right constraint.
回答2:
The error occurs when gaussian_kde()
tries to take the inverse of the covariance matrix of your input data. In order for the covariance matrix to be nonsingular, the number of (non-identical) points in your input must be >= to the number of variables. Try adding a third point and you should see that it works.
This answer on Crossvalidated has a proper explanation for why this is the case.
来源:https://stackoverflow.com/questions/19261858/kde-fails-with-two-points