I have a list of names like:
names = [\'A\', \'B\', \'C\', \'D\']
and a list of documents, that in each documents some of these names are m
We can hugely simplify this using NetworkX
. Herenames
are the nodes we want to consider, and the lists in document
contains nodes to connect.
We can connect the nodes in each sublist taking the length 2 combinations
, and create a MultiGraph
to account for the co-occurrence:
import networkx as nx
from itertools import combinations
G = nx.MultiGraph()
G = nx.from_edgelist((c for n_nodes in document for c in combinations(n_nodes, r=2)),
create_using=nx.MultiGraph)
nx.to_pandas_adjacency(G, nodelist=names, dtype='int')
A B C D
A 0 2 1 1
B 2 0 2 1
C 1 2 0 1
D 1 1 1 0
I was facing the same issue... So i came with this code. This code takes into account context window and then determines co_occurance matrix.
Hope this helps you...
def countOccurences(word,context_window):
"""
This function returns the count of context word.
"""
return context_window.count(word)
def co_occurance(feature_dict,corpus,window = 5):
"""
This function returns co_occurance matrix for the given window size. Default is 5.
"""
length = len(feature_dict)
co_matrix = np.zeros([length,length]) # n is the count of all words
corpus_len = len(corpus)
for focus_word in top_features:
for context_word in top_features[top_features.index(focus_word):]:
# print(feature_dict[context_word])
if focus_word == context_word:
co_matrix[feature_dict[focus_word],feature_dict[context_word]] = 0
else:
start_index = 0
count = 0
while(focus_word in corpus[start_index:]):
# get the index of focus word
start_index = corpus.index(focus_word,start_index)
fi,li = max(0,start_index - window) , min(corpus_len-1,start_index + window)
count += countOccurences(context_word,corpus[fi:li+1])
# updating start index
start_index += 1
# update [Aij]
co_matrix[feature_dict[focus_word],feature_dict[context_word]] = count
# update [Aji]
co_matrix[feature_dict[context_word],feature_dict[focus_word]] = count
return co_matrix