Co-occurrence matrix from nested list of words

前端 未结 8 770
无人及你
无人及你 2020-11-30 10:27

I have a list of names like:

names = [\'A\', \'B\', \'C\', \'D\']

and a list of documents, that in each documents some of these names are m

相关标签:
8条回答
  • 2020-11-30 10:44

    We can hugely simplify this using NetworkX. Herenames are the nodes we want to consider, and the lists in document contains nodes to connect.

    We can connect the nodes in each sublist taking the length 2 combinations, and create a MultiGraph to account for the co-occurrence:

    import networkx as nx
    from itertools import combinations
    
    G = nx.MultiGraph()
    G = nx.from_edgelist((c for n_nodes in document for c in combinations(n_nodes, r=2)),
                         create_using=nx.MultiGraph)
    nx.to_pandas_adjacency(G, nodelist=names, dtype='int')
    
       A  B  C  D
    A  0  2  1  1
    B  2  0  2  1
    C  1  2  0  1
    D  1  1  1  0
    
    0 讨论(0)
  • 2020-11-30 10:45

    I was facing the same issue... So i came with this code. This code takes into account context window and then determines co_occurance matrix.

    Hope this helps you...

    def countOccurences(word,context_window): 
    
        """
        This function returns the count of context word.
        """ 
        return context_window.count(word)
    
    def co_occurance(feature_dict,corpus,window = 5):
        """
        This function returns co_occurance matrix for the given window size. Default is 5.
    
        """
        length = len(feature_dict)
        co_matrix = np.zeros([length,length]) # n is the count of all words
    
        corpus_len = len(corpus)
        for focus_word in top_features:
    
            for context_word in top_features[top_features.index(focus_word):]:
                # print(feature_dict[context_word])
                if focus_word == context_word:
                    co_matrix[feature_dict[focus_word],feature_dict[context_word]] = 0
                else:
                    start_index = 0
                    count = 0
                    while(focus_word in corpus[start_index:]):
    
                        # get the index of focus word
                        start_index = corpus.index(focus_word,start_index)
                        fi,li = max(0,start_index - window) , min(corpus_len-1,start_index + window)
    
                        count += countOccurences(context_word,corpus[fi:li+1])
                        # updating start index
                        start_index += 1
    
                    # update [Aij]
                    co_matrix[feature_dict[focus_word],feature_dict[context_word]] = count
                    # update [Aji]
                    co_matrix[feature_dict[context_word],feature_dict[focus_word]] = count
        return co_matrix
    
    0 讨论(0)
提交回复
热议问题