Generate correlated data in Python (3.3)

前端 未结 2 1290
臣服心动
臣服心动 2020-12-28 09:51

In R there is a function (cm.rnorm.cor, from package CreditMetrics), that takes the amount of samples, the amount of variables, and a correlation m

相关标签:
2条回答
  • 2020-12-28 10:27

    If you Cholesky-decompose a covariance matrix C into L L^T, and generate an independent random vector x, then Lx will be a random vector with covariance C.

    import numpy as np
    import matplotlib.pyplot as plt
    linalg = np.linalg
    np.random.seed(1)
    
    num_samples = 1000
    num_variables = 2
    cov = [[0.3, 0.2], [0.2, 0.2]]
    
    L = linalg.cholesky(cov)
    # print(L.shape)
    # (2, 2)
    uncorrelated = np.random.standard_normal((num_variables, num_samples))
    mean = [1, 1]
    correlated = np.dot(L, uncorrelated) + np.array(mean).reshape(2, 1)
    # print(correlated.shape)
    # (2, 1000)
    plt.scatter(correlated[0, :], correlated[1, :], c='green')
    plt.show()
    

    enter image description here

    Reference: See Cholesky decomposition


    If you want to generate two series, X and Y, with a particular (Pearson) correlation coefficient (e.g. 0.2):

    rho = cov(X,Y) / sqrt(var(X)*var(Y))
    

    you could choose the covariance matrix to be

    cov = [[1, 0.2],
           [0.2, 1]]
    

    This makes the cov(X,Y) = 0.2, and the variances, var(X) and var(Y) both equal to 1. So rho would equal 0.2.

    For example, below we generate pairs of correlated series, X and Y, 1000 times. Then we plot a histogram of the correlation coefficients:

    import numpy as np
    import matplotlib.pyplot as plt
    import scipy.stats as stats
    linalg = np.linalg
    np.random.seed(1)
    
    num_samples = 1000
    num_variables = 2
    cov = [[1.0, 0.2], [0.2, 1.0]]
    
    L = linalg.cholesky(cov)
    
    rhos = []
    for i in range(1000):
        uncorrelated = np.random.standard_normal((num_variables, num_samples))
        correlated = np.dot(L, uncorrelated)
        X, Y = correlated
        rho, pval = stats.pearsonr(X, Y)
        rhos.append(rho)
    
    plt.hist(rhos)
    plt.show()
    

    enter image description here

    As you can see, the correlation coefficients are generally near 0.2, but for any given sample, the correlation will most likely not be 0.2 exactly.

    0 讨论(0)
  • 2020-12-28 10:50

    numpy.random.multivariate_normal is the function that you want.

    Example:

    import numpy as np
    import matplotlib.pyplot as plt
    
    
    num_samples = 400
    
    # The desired mean values of the sample.
    mu = np.array([5.0, 0.0, 10.0])
    
    # The desired covariance matrix.
    r = np.array([
            [  3.40, -2.75, -2.00],
            [ -2.75,  5.50,  1.50],
            [ -2.00,  1.50,  1.25]
        ])
    
    # Generate the random samples.
    y = np.random.multivariate_normal(mu, r, size=num_samples)
    
    
    # Plot various projections of the samples.
    plt.subplot(2,2,1)
    plt.plot(y[:,0], y[:,1], 'b.')
    plt.plot(mu[0], mu[1], 'ro')
    plt.ylabel('y[1]')
    plt.axis('equal')
    plt.grid(True)
    
    plt.subplot(2,2,3)
    plt.plot(y[:,0], y[:,2], 'b.')
    plt.plot(mu[0], mu[2], 'ro')
    plt.xlabel('y[0]')
    plt.ylabel('y[2]')
    plt.axis('equal')
    plt.grid(True)
    
    plt.subplot(2,2,4)
    plt.plot(y[:,1], y[:,2], 'b.')
    plt.plot(mu[1], mu[2], 'ro')
    plt.xlabel('y[1]')
    plt.axis('equal')
    plt.grid(True)
    
    plt.show()
    

    Result:

    enter image description here

    See also CorrelatedRandomSamples in the SciPy Cookbook.

    0 讨论(0)
提交回复
热议问题