问题
The MWE below shows two ways of integrating the same 2D kernel density estimate, obtained for this data using the stats.gaussian_kde()
function.
The integration is performed for all (x, y)
below the threshold point (x1, y1)
, which defines the upper integration limits (lower integration limits are -infinity
; see MWE).
- The
int1
function uses simple a Monte Carlo approach. - The
int2
function uses the scipy.integrate.nquad function.
The issue is that int1
(ie: the Monte Carlo method) gives systematically larger values for the integral than int2
. I don't know why this happens.
Here's an example of the integral values obtained after 200 runs of int1
(blue histogram) versus the integral result given by int2
(red vertical line):
What is the origin of this difference in the resulting integral value?
MWE
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from scipy import integrate
def int1(kernel, x1, y1):
# Compute the point below which to integrate
iso = kernel((x1, y1))
# Sample KDE distribution
sample = kernel.resample(size=50000)
# Filter the sample
insample = kernel(sample) < iso
# The integral is equivalent to the probability of drawing a
# point that gets through the filter
integral = insample.sum() / float(insample.shape[0])
return integral
def int2(kernel, x1, y1):
def f_kde(x, y):
return kernel((x, y))
# 2D integration in: (-inf, x1), (-inf, y1).
integral = integrate.nquad(f_kde, [[-np.inf, x1], [-np.inf, y1]])
return integral
# Obtain data from file.
data = np.loadtxt('data.dat', unpack=True)
# Perform a kernel density estimate (KDE) on the data
kernel = stats.gaussian_kde(data)
# Define the threshold point that determines the integration limits.
x1, y1 = 2.5, 1.5
i2 = int2(kernel, x1, y1)
print i2
int1_vals = []
for _ in range(200):
i = int1(kernel, x1, y1)
int1_vals.append(i)
print i
Add
Notice that this question originated from this answer. At first I didn't notice that the answer was mistaken in the integration limits used, which explains why the results between int1
and int2
are different.
int1
is integrating in the domain f(x,y)<f(x1,y1)
(where f is the kernel density estimate), while int2
integrates in the domain (x,y)<(x1,y1)
.
回答1:
You resample the distribution
sample = kernel.resample(size=50000)
and then compute the probability for each sampled point is less than the probability at the bound
insample = kernel(sample) < iso
This is incorrect. Consider the bounds (0,100) and assume your data has u=(0,0) and cov=[[100,0],[0,100]]. Points (0,50) and (50,0) have the same probability in this kernel, but only one of them is in the bounds. Since both pass the test, you are over sampling.
You should be testing whether each point in sample
is inside the bounds, then compute the probability. Something like
def int1(kernel, x1, y1):
# Sample KDE distribution
sample = kernel.resample(size=100)
include = (sample < np.repeat([[x1],[y1]],sample.shape[1],axis=1)).all(axis=0)
integral = include.sum() / float(sample.shape[1])
return integral
I tested this using the following code
def measure(n):
m1 = np.random.normal(size=n)
m2 = np.random.normal(size=n)
return m1,m2
a = scipy.stats.gaussian_kde( np.vstack(measure(1000)) )
print(int1(a,-10,-10))
print(int2(a,-10,-10))
print(int1(a,0,0))
print(int2(a,-0,-0))
Yields
0.0
(4.304674927251112e-232, 4.6980863813551415e-230)
0.26
(0.25897626178338407, 1.4536217446381293e-08)
Monte Carlo integration should work like this
- Sample N random values (uniformly, not from your distribution) over some subset of the possible values of x/y (below I bound it by 10 SDs from mean).
- For each random value compute kernel(rand_x,rand_y)
- Compute the sum and multiply by (volume)/N_samples
In code:
def mc_wo_sample(kernel,x1,y1,lboundx,lboundy):
nsamples = 50000
volume = (x1-lboundx)*(y1-lboundy)
# generate uniform points in range
xrand = np.random.rand(nsamples,1)*(x1-lboundx) + lboundx
yrand = np.random.rand(nsamples,1)*(y1-lboundy) + lboundy
randvals = np.hstack((xrand,yrand)).transpose()
print randvals.shape
return (volume*kernel(randvals).sum())/nsamples
Running the following
print(int1(a,-9,-9))
print(int2(a,-9,-9))
print(mc_wo_sample(a,-9,-9,-10,-10))
print(int1(a,0,0))
print(int2(a,-0,-0))
print(mc_wo_sample(a,0,0,-10,-10))
yields
0.0
(4.012958496109042e-70, 6.7211236076277e-71)
4.08538890986e-70
0.36
(0.37101621760650216, 1.4670898180664756e-08)
0.361614657674
来源:https://stackoverflow.com/questions/35902531/different-integration-results-using-monte-carlo-vs-scipy-integrate-nquad