Co-occurrence matrix from nested list of words

前端未结

关注

 8  770

I have a list of names like:

names = [\'A\', \'B\', \'C\', \'D\']

and a list of documents, that in each documents some of these names are m

相关标签:

8条回答

死守一世寂寞

2020-11-30 10:44
We can hugely simplify this using NetworkX. Herenames are the nodes we want to consider, and the lists in document contains nodes to connect.

We can connect the nodes in each sublist taking the length 2 combinations, and create a MultiGraph to account for the co-occurrence:
```
import networkx as nx
from itertools import combinations

G = nx.MultiGraph()
G = nx.from_edgelist((c for n_nodes in document for c in combinations(n_nodes, r=2)),
                     create_using=nx.MultiGraph)
nx.to_pandas_adjacency(G, nodelist=names, dtype='int')

   A  B  C  D
A  0  2  1  1
B  2  0  2  1
C  1  2  0  1
D  1  1  1  0
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

无人共我

2020-11-30 10:45

I was facing the same issue... So i came with this code. This code takes into account context window and then determines co_occurance matrix.

Hope this helps you...

def countOccurences(word,context_window): 

    """
    This function returns the count of context word.
    """ 
    return context_window.count(word)

def co_occurance(feature_dict,corpus,window = 5):
    """
    This function returns co_occurance matrix for the given window size. Default is 5.

    """
    length = len(feature_dict)
    co_matrix = np.zeros([length,length]) # n is the count of all words

    corpus_len = len(corpus)
    for focus_word in top_features:

        for context_word in top_features[top_features.index(focus_word):]:
            # print(feature_dict[context_word])
            if focus_word == context_word:
                co_matrix[feature_dict[focus_word],feature_dict[context_word]] = 0
            else:
                start_index = 0
                count = 0
                while(focus_word in corpus[start_index:]):

                    # get the index of focus word
                    start_index = corpus.index(focus_word,start_index)
                    fi,li = max(0,start_index - window) , min(corpus_len-1,start_index + window)

                    count += countOccurences(context_word,corpus[fi:li+1])
                    # updating start index
                    start_index += 1

                # update [Aij]
                co_matrix[feature_dict[focus_word],feature_dict[context_word]] = count
                # update [Aji]
                co_matrix[feature_dict[context_word],feature_dict[focus_word]] = count
    return co_matrix

0 讨论(0)

上一页 1 2