Pandas: Apply function over each pair of columns under constraints

别说谁变了你拦得住时间么 提交于 2019-12-11 01:55:56

问题


As the title says, I'm trying to apply a function over each pair of columns of a dataframe under some conditions. I'm going to try to illustrate this. My df is of the form:

Code |  14  |  17  |  19  | ...
w1   |  0   |   5  |   3  | ...
w2   |  2   |   5  |   4  | ... 
w3   |  0   |   0  |   5  | ...

The Code corresponds to a determined location in a rectangular grid and the ws are different words. I would like to apply cosine similarity measure between each pair of columns only (EDITED!) if the sum of items in one of the columns of the pair is greater thah 5.

The desired output would be something like:

     | [14,17]  |  [14,19]  |  [14,...]  |  [17,19]  | ...
Sim  |cs(14,17) |cs(14,19)  |cs(14,...)  |cs(17,19)..| ...

cs is the result of the cosine similarity for each pair of columns. Is there any suitable method to do this?

Any help would be appreciated :-)


回答1:


To apply the cosine metric to each pair from two collections of inputs, you could use scipy.spatial.distance.cdist. This will be much much faster than using a double Python loop.

Let one collection be all the columns of df. Let the other collection be only those columns where the sum is greater than 5:

import pandas as pd
df = pd.DataFrame({'14':[0,2,0], '17':[5,5,0], '19':[3,4,5]})
mask = df.sum(axis=0) > 5
df2 = df.loc[:, mask]

Then all the cosine similarities can be computed with one call to cdist:

import scipy.spatial.distance as SSD
values = SSD.cdist(df2.T, df.T, metric='cosine')
# array([[  2.92893219e-01,   1.11022302e-16,   3.00000000e-01],
#        [  4.34314575e-01,   3.00000000e-01,   1.11022302e-16]])

The values can be wrapped in a new DataFrame and reshaped:

result = pd.DataFrame(values, columns=df.columns, index=df2.columns)
result = result.stack()

import pandas as pd
import scipy.spatial.distance as SSD
df = pd.DataFrame({'14':[0,2,0], '17':[5,5,0], '19':[3,4,5]})
mask = df.sum(axis=0) > 5
df2 = df.loc[:, mask]
values = SSD.cdist(df2.T, df.T, metric='cosine')
result = pd.DataFrame(values, columns=df.columns, index=df2.columns)
result = result.stack()
mask = result.index.get_level_values(0) != result.index.get_level_values(1)
result = result.loc[mask]
print(result)

yields the Series

17  14    0.292893
    19    0.300000
19  14    0.434315
    17    0.300000


来源:https://stackoverflow.com/questions/38455278/pandas-apply-function-over-each-pair-of-columns-under-constraints

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!