In R there is a function (cm.rnorm.cor
, from package CreditMetrics
), that takes the amount of samples, the amount of variables, and a correlation m
If you Cholesky-decompose a covariance matrix C
into L L^T
, and generate an
independent random vector x
, then Lx
will be a random vector with covariance
C
.
import numpy as np
import matplotlib.pyplot as plt
linalg = np.linalg
np.random.seed(1)
num_samples = 1000
num_variables = 2
cov = [[0.3, 0.2], [0.2, 0.2]]
L = linalg.cholesky(cov)
# print(L.shape)
# (2, 2)
uncorrelated = np.random.standard_normal((num_variables, num_samples))
mean = [1, 1]
correlated = np.dot(L, uncorrelated) + np.array(mean).reshape(2, 1)
# print(correlated.shape)
# (2, 1000)
plt.scatter(correlated[0, :], correlated[1, :], c='green')
plt.show()
Reference: See Cholesky decomposition
If you want to generate two series, X
and Y
, with a particular (Pearson) correlation coefficient (e.g. 0.2):
rho = cov(X,Y) / sqrt(var(X)*var(Y))
you could choose the covariance matrix to be
cov = [[1, 0.2],
[0.2, 1]]
This makes the cov(X,Y) = 0.2
, and the variances, var(X)
and var(Y)
both equal to 1. So rho
would equal 0.2.
For example, below we generate pairs of correlated series, X
and Y
, 1000 times. Then we plot a histogram of the correlation coefficients:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
linalg = np.linalg
np.random.seed(1)
num_samples = 1000
num_variables = 2
cov = [[1.0, 0.2], [0.2, 1.0]]
L = linalg.cholesky(cov)
rhos = []
for i in range(1000):
uncorrelated = np.random.standard_normal((num_variables, num_samples))
correlated = np.dot(L, uncorrelated)
X, Y = correlated
rho, pval = stats.pearsonr(X, Y)
rhos.append(rho)
plt.hist(rhos)
plt.show()
As you can see, the correlation coefficients are generally near 0.2, but for any given sample, the correlation will most likely not be 0.2 exactly.
numpy.random.multivariate_normal is the function that you want.
Example:
import numpy as np
import matplotlib.pyplot as plt
num_samples = 400
# The desired mean values of the sample.
mu = np.array([5.0, 0.0, 10.0])
# The desired covariance matrix.
r = np.array([
[ 3.40, -2.75, -2.00],
[ -2.75, 5.50, 1.50],
[ -2.00, 1.50, 1.25]
])
# Generate the random samples.
y = np.random.multivariate_normal(mu, r, size=num_samples)
# Plot various projections of the samples.
plt.subplot(2,2,1)
plt.plot(y[:,0], y[:,1], 'b.')
plt.plot(mu[0], mu[1], 'ro')
plt.ylabel('y[1]')
plt.axis('equal')
plt.grid(True)
plt.subplot(2,2,3)
plt.plot(y[:,0], y[:,2], 'b.')
plt.plot(mu[0], mu[2], 'ro')
plt.xlabel('y[0]')
plt.ylabel('y[2]')
plt.axis('equal')
plt.grid(True)
plt.subplot(2,2,4)
plt.plot(y[:,1], y[:,2], 'b.')
plt.plot(mu[1], mu[2], 'ro')
plt.xlabel('y[1]')
plt.axis('equal')
plt.grid(True)
plt.show()
Result:
See also CorrelatedRandomSamples in the SciPy Cookbook.