问题
I have a sample data:
a=pd.DataFrame({'ACTIVITY':['b,c','a','a,c,d,e','f,g,h,i','j,k,l','k,l,m']})
What I want to do is merge some strings if they have sub strings in common. So, in this example, the strings 'b,c','a','a,c,d,e' should be merged together because they can be linked to each other. 'j,k,l' and 'k,l,m' should be in one group. In the end, I hope I can have something like:
group
'b,c', 0
'a', 0
'a,c,d,e', 0
'f,g,h,i', 1
'j,k,l', 2
'k,l,m' 2
So, I can have three groups and there is no common sub strings between any two groups.
Now, I am trying to build up a similarity data frame, in which 1 means two strings have sub strings in common. Here is my code:
commonWords=1
for i in np.arange(a.shape[0]):
a.loc[:,a.loc[i,'ACTIVITY']]=0
for i in a.loc[:,'ACTIVITY']:
il=i.split(',')
for j in a.loc[:,'ACTIVITY']:
jl=j.split(',')
c=[x in il for x in jl]
c1=[x for x in c if x==True]
a.loc[(a.loc[:,'ACTIVITY']==i),j]=1 if len(c1)>=commonWords else 0
a
The result is:
ACTIVITY b,c a a,c,d,e f,g,h,i j,k,l k,l,m
0 b,c 1 0 1 0 0 0
1 a 0 1 1 0 0 0
2 a,c,d,e 1 1 1 0 0 0
3 f,g,h,i 0 0 0 1 0 0
4 j,k,l 0 0 0 0 1 1
5 k,l,m 0 0 0 0 1 1
From here, you can see if there is 1, then the related row and columns should be merged together. But I am stuck here. So, could anyone please help me out here?
回答1:
Use networkx
with connected_components:
a=pd.DataFrame({'ACTIVITY':['b,c','a','a,c,d,e','f,g,h,i','j,k,l','k,l,m']})
import networkx as nx
from itertools import combinations, chain
#split values by , to lists
splitted = a['ACTIVITY'].str.split(',')
#create edges (can only connect two nodes)
L2_nested = [list(combinations(l,2)) for l in splitted]
L2 = list(chain.from_iterable(L2_nested))
print (L2)
[('b', 'c'), ('a', 'c'), ('a', 'd'), ('a', 'e'), ('c', 'd'),
('c', 'e'), ('d', 'e'), ('f', 'g'), ('f', 'h'), ('f', 'i'),
('g', 'h'), ('g', 'i'), ('h', 'i'), ('j', 'k'), ('j', 'l'),
('k', 'l'), ('k', 'l'), ('k', 'm'), ('l', 'm')]
#create the graph from the lists
G=nx.Graph()
G.add_edges_from(L2)
connected_comp = nx.connected_components(G)
#create dict for common values
node2id = {x: cid for cid, c in enumerate(connected_comp) for x in c}
# create groups by mapping first value of series called splitted
a['group'] = [node2id.get(x[0]) for x in splitted]
print (a)
ACTIVITY group
0 b,c 0
1 a 0
2 a,c,d,e 0
3 f,g,h,i 1
4 j,k,l 2
5 k,l,m 2
来源:https://stackoverflow.com/questions/62632266/how-to-merge-strings-that-have-substrings-in-common-to-produce-some-groups-in-a