how to merge strings that have certain number of substrings in common to produce some groups in a data frame in Python

不想你离开。 提交于 2020-07-07 11:12:27

问题


I asked a question like this. But that is a simple one. Which has been resolved. how to merge strings that have substrings in common to produce some groups in a data frame in Python.

But here, I have an advanced version of the similar question:

I have a sample data:

a=pd.DataFrame({'ACTIVITY':['b,c','a','a,c,d,e','f,g,h,i','j,k,l','k,l,m']})

What I want to do is merge some strings if they have sub strings in common. So, in this example, the strings 'b,c','a','a,c,d,e' should be merged together because they can be linked to each other. 'j,k,l' and 'k,l,m' should be in one group. In the end, I hope I can have something like:

               group
'b,c',         0
'a',           0
'a,c,d,e',     0
'f,g,h,i',     1
'j,k,l',       2
'k,l,m'        2

So, I can have three groups and there is no common sub strings between any two groups.

Now, I am trying to build up a similarity data frame, in which 1 means two strings have sub strings in common. Here is my code:

commonWords=1

for i in np.arange(a.shape[0]):
    a.loc[:,a.loc[i,'ACTIVITY']]=0

for i in a.loc[:,'ACTIVITY']:
    il=i.split(',')
    for j in a.loc[:,'ACTIVITY']:
        jl=j.split(',')
        c=[x in il for x in jl]
        c1=[x for x in c if x==True]
        a.loc[(a.loc[:,'ACTIVITY']==i),j]=1 if len(c1)>=commonWords else 0
    
a

The result is:

    ACTIVITY    b,c     a   a,c,d,e     f,g,h,i     j,k,l   k,l,m
0   b,c          1      0       1           0       0       0
1   a            0      1       1           0       0       0
2   a,c,d,e      1      1       1           0       0       0
3   f,g,h,i      0      0       0           1       0       0
4   j,k,l        0      0       0           0       1       1
5   k,l,m        0      0       0           0       1       1

In this code, commonWords means how many sub strings I hope that two strings have in common. For example, if commonWords=2, then two strings will be merged together only if there are two, or more than two sub strings in them. When commonWords=2, the group should be:

               group
'b,c',         0
'a',           1
'a,c,d,e',     2
'f,g,h,i',     3
'j,k,l',       4
'k,l,m'        4

Could anyone please help me out here?


回答1:


Use:

a=pd.DataFrame({'ACTIVITY':['b,c','a','a,c,d,e','f,g,h,i','j,k,l','k,l,m']})


from itertools import combinations, chain
from collections import Counter

#split values by , to lists
splitted = a['ACTIVITY'].str.split(',')

commonWords=2
#create edges (can only connect two nodes)
L2_nested = [list(combinations(l,commonWords)) for l in splitted]
L2 = list(chain.from_iterable(L2_nested))

#convert values to sets
f1 = [set(k) for k, v in Counter(L2).items() if v >= commonWords]
f2 = [set(x) for x in splitted]

#create new columns for matched sets
for val in f1:
    j = ','.join(val)
    a[j] = [j if len(val & x) == commonWords else np.nan for x in f2]
print (a)

#forward filling values of new columns and use factorize for groups
new = pd.factorize(a[['ACTIVITY']].assign(ACTIVITY = a.index).ffill(axis=1).iloc[:, -1])[0]

a = a[['ACTIVITY']].assign(group = new)
print (a)
  ACTIVITY  group
0      b,c      0
1        a      1
2  a,c,d,e      2
3  f,g,h,i      3
4    j,k,l      4
5    k,l,m      4


来源:https://stackoverflow.com/questions/62632874/how-to-merge-strings-that-have-certain-number-of-substrings-in-common-to-produce

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!