How do I calculate a word-word co-occurrence matrix with sklearn?

后端 未结 6 962
南旧
南旧 2020-12-01 03:06

I am looking for a module in sklearn that lets you derive the word-word co-occurrence matrix.

I can get the document-term matrix but not sure how to go about obtain

相关标签:
6条回答
  • 2020-12-01 03:37

    @titipata I think your solution is not a good metric because we are giving the same weight to real co-ocurrences and to occurrences that are just spurious. For example, if I have 5 texts and the words apple and house appears with this frecuency:

    text1: apple:10, "house":1

    text2: apple:10, "house":0

    text3: apple:10, "house":0

    text4: apple:10, "house":0

    text5: apple:10, "house":0

    The co-occurrence we are going to measure is 10*1+10*0+10*0+10*0+10*0=10, but is just spurious.

    And, in this another important cases, like the following:

    text1: apple:1, "banana":1

    text2: apple:1, "banana":1

    text3: apple:1, "banana":1

    text4: apple:1, "banana":1

    text5: apple:1, "banana":1

    we are going to get just a co-occurrence of 1*1+1*1+1*1+1*1=5, when in fact that co-occurrence really important.

    @Guiem Bosch In this case co-occurrences are measured only when the two words are contiguous.

    I propose to use something the @titipa solution to compute the matrix:

    Xc = (Y.T * Y) # this is co-occurrence matrix in sparse csr format
    

    where, instead of using X, use a matrix Y with ones in positions greater than 0 and zeros in another positions.

    Using this, in the first example we are going to have: co-occurrence:1*1+1*0+1*0+1*0+1*0=1 and in the second example: co-occurrence:1*1+1*1+1*1+1*1+1*0=5 which is what we are really looking for.

    0 讨论(0)
  • 2020-12-01 03:40

    All the provided answers didn't use the window-moving concept into consideration. So, I did my own function that does find the co-occurrence matrix by applying a moving window of a defined size. This function takes a list of sentences and returns a pandas.DataFrame object representing the co-occurrence matrix and a window_size number:

    def co_occurrence(sentences, window_size):
        d = defaultdict(int)
        vocab = set()
        for text in sentences:
            # preprocessing (use tokenizer instead)
            text = text.lower().split()
            # iterate over sentences
            for i in range(len(text)):
                token = text[i]
                vocab.add(token)  # add to vocab
                next_token = text[i+1 : i+1+window_size]
                for t in next_token:
                    key = tuple( sorted([t, token]) )
                    d[key] += 1
    
        # formulate the dictionary into dataframe
        vocab = sorted(vocab) # sort vocab
        df = pd.DataFrame(data=np.zeros((len(vocab), len(vocab)), dtype=np.int16),
                          index=vocab,
                          columns=vocab)
        for key, value in d.items():
            df.at[key[0], key[1]] = value
            df.at[key[1], key[0]] = value
        return df
    

    Let's try it out given the following two simple sentences:

    >>> text = ["I go to school every day by bus .",
                "i go to theatre every night by bus"]
    >>> 
    >>> df = co_occurrence(text, 2)
    >>> df
             .  bus  by  day  every  go  i  night  school  theatre  to
    .        0    1   1    0      0   0  0      0       0        0   0
    bus      1    0   2    1      0   0  0      1       0        0   0
    by       1    2   0    1      2   0  0      1       0        0   0
    day      0    1   1    0      1   0  0      0       1        0   0
    every    0    0   2    1      0   0  0      1       1        1   2
    go       0    0   0    0      0   0  2      0       1        1   2
    i        0    0   0    0      0   2  0      0       0        0   2
    night    0    1   1    0      1   0  0      0       0        1   0
    school   0    0   0    1      1   1  0      0       0        0   1
    theatre  0    0   0    0      1   1  0      1       0        0   1
    to       0    0   0    0      2   2  2      0       1        1   0
    
    [11 rows x 11 columns]
    

    Now, we have our co-occurrence matrix.

    0 讨论(0)
  • 2020-12-01 03:44

    You can use the ngram_range parameter in the CountVectorizer or TfidfVectorizer

    Code example:

    bigram_vectorizer = CountVectorizer(ngram_range=(2, 2)) # by saying 2,2 you are telling you only want pairs of 2 words
    

    In case you want to explicitly say which co-occurrences of words you want to count, use the vocabulary param, i.e: vocabulary = {'awesome unicorns':0, 'batman forever':1}

    http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

    Self-explanatory and ready to use code with predefined word-word co-occurrences. In this case we are tracking for co-occurrences of awesome unicorns and batman forever:

    from sklearn.feature_extraction.text import CountVectorizer
    import numpy as np
    samples = ['awesome unicorns are awesome','batman forever and ever','I love batman forever']
    bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), vocabulary = {'awesome unicorns':0, 'batman forever':1}) 
    co_occurrences = bigram_vectorizer.fit_transform(samples)
    print 'Printing sparse matrix:', co_occurrences
    print 'Printing dense matrix (cols are vocabulary keys 0-> "awesome unicorns", 1-> "batman forever")', co_occurrences.todense()
    sum_occ = np.sum(co_occurrences.todense(),axis=0)
    print 'Sum of word-word occurrences:', sum_occ
    print 'Pretty printig of co_occurrences count:', zip(bigram_vectorizer.get_feature_names(),np.array(sum_occ)[0].tolist())
    

    Final output is ('awesome unicorns', 1), ('batman forever', 2), which corresponds exactly to our samples provided data.

    0 讨论(0)
  • 2020-12-01 03:50

    with numpy, as corpus would be list of lists (each list a tokenized document):

    corpus = [['<START>', 'All', 'that', 'glitters', "isn't", 'gold', '<END>'], 
              ['<START>', "All's", 'well', 'that', 'ends', 'well', '<END>']]
    

    and a word->row/col mapping

    def compute_co_occurrence_matrix(corpus, window_size):
    
        words = sorted(list(set([word for words_list in corpus for word in words_list])))
        num_words = len(words)
    
        M = np.zeros((num_words, num_words))
        word2Ind = dict(zip(words, range(num_words)))
    
        for doc in corpus:
    
            cur_idx = 0
            doc_len = len(doc)
    
            while cur_idx < doc_len:
    
                left = max(cur_idx-window_size, 0)
                right = min(cur_idx+window_size+1, doc_len)
                words_to_add = doc[left:cur_idx] + doc[cur_idx+1:right]
                focus_word = doc[cur_idx]
    
                for word in words_to_add:
                    outside_idx = word2Ind[word]
                    M[outside_idx, word2Ind[focus_word]] += 1
    
                cur_idx += 1
    
        return M, word2Ind
    
    0 讨论(0)
  • 2020-12-01 03:56

    I used the below code for creating co-occurrance matrix with window size:

    #https://stackoverflow.com/questions/4843158/check-if-a-python-list-item-contains-a-string-inside-another-string
    import pandas as pd
    def co_occurance_matrix(input_text,top_words,window_size):
        co_occur = pd.DataFrame(index=top_words, columns=top_words)
    
        for row,nrow in zip(top_words,range(len(top_words))):
            for colm,ncolm in zip(top_words,range(len(top_words))):        
                count = 0
                if row == colm: 
                    co_occur.iloc[nrow,ncolm] = count
                else: 
                    for single_essay in input_text:
                        essay_split = single_essay.split(" ")
                        max_len = len(essay_split)
                        top_word_index = [index for index, split in enumerate(essay_split) if row in split]
                        for index in top_word_index:
                            if index == 0:
                                count = count + essay_split[:window_size + 1].count(colm)
                            elif index == (max_len -1): 
                                count = count + essay_split[-(window_size + 1):].count(colm)
                            else:
                                count = count + essay_split[index + 1 : (index + window_size + 1)].count(colm)
                                if index < window_size: 
                                    count = count + essay_split[: index].count(colm)
                                else:
                                    count = count + essay_split[(index - window_size): index].count(colm)
                    co_occur.iloc[nrow,ncolm] = count
    
        return co_occur
    

    then i used the below code to perform test:

    corpus = ['ABC DEF IJK PQR','PQR KLM OPQ','LMN PQR XYZ ABC DEF PQR ABC']
    words = ['ABC','PQR','DEF']
    window_size =2 
    
    result = co_occurance_matrix(corpus,words,window_size)
    result
    

    Output is here:

    0 讨论(0)
  • 2020-12-01 03:57

    Here is my example solution using CountVectorizer in scikit-learn. And referring to this post, you can simply use matrix multiplication to get word-word co-occurrence matrix.

    from sklearn.feature_extraction.text import CountVectorizer
    docs = ['this this this book',
            'this cat good',
            'cat good shit']
    count_model = CountVectorizer(ngram_range=(1,1)) # default unigram model
    X = count_model.fit_transform(docs)
    # X[X > 0] = 1 # run this line if you don't want extra within-text cooccurence (see below)
    Xc = (X.T * X) # this is co-occurrence matrix in sparse csr format
    Xc.setdiag(0) # sometimes you want to fill same word cooccurence to 0
    print(Xc.todense()) # print out matrix in dense format
    

    You can also refer to dictionary of words in count_model,

    count_model.vocabulary_
    

    Or, if you want to normalize by diagonal component (referred to answer in previous post).

    import scipy.sparse as sp
    Xc = (X.T * X)
    g = sp.diags(1./Xc.diagonal())
    Xc_norm = g * Xc # normalized co-occurence matrix
    

    Extra to note @Federico Caccia answer, if you don't want co-occurrence that are spurious from the own text, set occurrence that is greater that 1 to 1 e.g.

    X[X > 0] = 1 # do this line first before computing cooccurrence
    Xc = (X.T * X)
    ...
    
    0 讨论(0)
提交回复
热议问题