【Python】当你有10000+个一维向量, 怎么算相关系数最快?

背景

最近工作中需要用python计算大量的一维向量之间的相关系数, 其中:

测试数据: (1000, 100) one of them (100,)

模板数据: (1000, 100) one of them (100,)

那么就需要计算1,000,000次相关系数,那么在这种情况下, 计算效率就变得很重要了.

常见的几种计算相关系数的方法有:

基于Pandas

DataFrame.corr()

实践了特别慢, 需要构建frame, 不推荐.

基于Numpy

1. np.cov()

这个是算协方差, 后续还需要手写代码进一步计算相关系数

2.np.corrcoef()

这个其实就是皮尔森系数的numpy实现, 算线性相关的.

基于scipy的三大相关性系数 (pearson系数, spearman系数, kendll系数)

3.stats.pearsonr()

皮尔森系数, 协方差cov(X,Y)除以它们各自标准差的乘积(σX, σY).

计算正态分布数据的线性关系, 非线性关系上表现差.优点是计算简单, 结果就是相关系数r, 也比较好理解.

4.stats.spearmanr()

斯皮尔曼相关性系数, 可以理解为秩次之间的相关性, 只计算他们之间分别排序以后的位置的差距, 不在乎真实的数值. 需要用计算出来的p值和一个spearman秩相关系数界值表对比, 比较麻烦, 而且很难用相关的程度来描述.

5.stats.kendlltau()

肯德尔相关性系数,它也是一种秩相关系数, 不过它所计算的对象是分类变量. 适用于两个分类变量均为有序分类的情况.只看排序,不看具体数值. 取值范围在-1-1之间.

具体实践

import numpy as np
import time
from scipy.stats import pearsonr, spearmanr, kendalltau

# Generate random data
a = np.random.random((1000, 100))
b = np.random.random((1000, 100))
corrmat = np.zeros((a.shape[0], b.shape[0]))

tic1 = time.time()
idx = 0
for i in range(a.shape[0]):
    for j in range(b.shape[0]):
        print('calculate:%d' % idx)
        c = np.cov(a[i], b[j])
        corrmat[i][j] = c[0, 1] / np.sqrt(np.prod(np.diag(c)))
        idx += 1
toc1 = time.time()
idx = 0
tic2 = time.time()
for i in range(a.shape[0]):
    for j in range(b.shape[0]):
        print('calculate:%d' % idx)
        corrmat[i][j] = np.corrcoef(a[i], b[j])[0, 1]
        idx += 1
toc2 = time.time()

idx = 0
tic3 = time.time()
for i in range(a.shape[0]):
    for j in range(b.shape[0]):
        print('calculate:%d' % idx)
        corrmat[i][j] = pearsonr(a[i], b[j])[0]
        idx += 1
toc3 = time.time()

idx = 0
tic4 = time.time()
for i in range(a.shape[0]):
    for j in range(b.shape[0]):
        print('calculate:%d' % idx)
        corrmat[i][j] = spearmanr(a[i], b[j])[0]
        idx += 1
toc4 = time.time()

idx = 0
tic5 = time.time()
for i in range(a.shape[0]):
    for j in range(b.shape[0]):
        print('calculate:%d' % idx)
        corrmat[i][j] = kendalltau(a[i], b[j])[0]
        idx += 1
toc5 = time.time()

print('----------------------')
print('Method 1 time: %.2f' % (toc1 - tic1))
print('Method 2 time: %.2f' % (toc2 - tic2))
print('Method 3 time: %.2f' % (toc3 - tic3))
print('Method 4 time: %.2f' % (toc4 - tic4))
print('Method 5 time: %.2f' % (toc5 - tic5))

output

Method 1 time: 128.00
Method 2 time: 75.88
Method 3 time: 60.78
Method 4 time: 564.74
Method 5 time: 297.49

结论:

虽然都是计算皮尔森系数,但是scipy的比numpy自身的算的要快, 第一种方法手写代码算更慢, 两种秩排序都比较慢.

如果你的数据是正态分布, 连续数据, 线性相关, 最好是用scipy的stats.pearsonr() 来计算.

参考文献:

https://blog.csdn.net/yanjiangdi/article/details/100939969

--------------------------------------以下是更新--------------------------------

本着实验的原则,还是测试了基于pandas的corr()这个方法的时间, 计算了在method为皮尔森 (也就是默认方法,这个可以更改为spearman,或者kendall ) 时的时间消耗

import time
import numpy as np
import pandas as pd

a = np.random.random((1000, 100))
b = np.random.random((1000, 100))
corrmat = np.zeros((a.shape[0], b.shape[0]))

idx = 0
tic6 = time.time()
for i in range(a.shape[0]):
     for j in range(b.shape[0]):
        aa = a[i]
        bb = b[j]
        data = pd.DataFrame({'A':a[i],
                             'B':b[j]})
        corrmat[i][j] = data.corr().values[0, 1]
toc6 = time.time()

output:

Method 6 time: 810.66

补充结论:

说过了, 构建dataframe是真的慢, pandas又不是主要做数据计算的, 慢也很合理.

来源：CSDN

作者：AnnieBee

链接：https://blog.csdn.net/weixin_40006612/article/details/103239860

标签