I am looking for a module in sklearn that lets you derive the word-word co-occurrence matrix.
I can get the document-term matrix but not sure how to go about obtain
@titipata I think your solution is not a good metric because we are giving the same weight to real co-ocurrences and to occurrences that are just spurious. For example, if I have 5 texts and the words apple and house appears with this frecuency:
text1: apple:10, "house":1
text2: apple:10, "house":0
text3: apple:10, "house":0
text4: apple:10, "house":0
text5: apple:10, "house":0
The co-occurrence we are going to measure is 10*1+10*0+10*0+10*0+10*0=10, but is just spurious.
And, in this another important cases, like the following:
text1: apple:1, "banana":1
text2: apple:1, "banana":1
text3: apple:1, "banana":1
text4: apple:1, "banana":1
text5: apple:1, "banana":1
we are going to get just a co-occurrence of 1*1+1*1+1*1+1*1=5, when in fact that co-occurrence really important.
@Guiem Bosch In this case co-occurrences are measured only when the two words are contiguous.
I propose to use something the @titipa solution to compute the matrix:
Xc = (Y.T * Y) # this is co-occurrence matrix in sparse csr format
where, instead of using X, use a matrix Y with ones in positions greater than 0 and zeros in another positions.
Using this, in the first example we are going to have: co-occurrence:1*1+1*0+1*0+1*0+1*0=1 and in the second example: co-occurrence:1*1+1*1+1*1+1*1+1*0=5 which is what we are really looking for.
All the provided answers didn't use the window-moving concept into consideration. So, I did my own function that does find the co-occurrence matrix by applying a moving window of a defined size. This function takes a list of sentences and returns a pandas.DataFrame
object representing the co-occurrence matrix and a window_size
number:
def co_occurrence(sentences, window_size):
d = defaultdict(int)
vocab = set()
for text in sentences:
# preprocessing (use tokenizer instead)
text = text.lower().split()
# iterate over sentences
for i in range(len(text)):
token = text[i]
vocab.add(token) # add to vocab
next_token = text[i+1 : i+1+window_size]
for t in next_token:
key = tuple( sorted([t, token]) )
d[key] += 1
# formulate the dictionary into dataframe
vocab = sorted(vocab) # sort vocab
df = pd.DataFrame(data=np.zeros((len(vocab), len(vocab)), dtype=np.int16),
index=vocab,
columns=vocab)
for key, value in d.items():
df.at[key[0], key[1]] = value
df.at[key[1], key[0]] = value
return df
Let's try it out given the following two simple sentences:
>>> text = ["I go to school every day by bus .",
"i go to theatre every night by bus"]
>>>
>>> df = co_occurrence(text, 2)
>>> df
. bus by day every go i night school theatre to
. 0 1 1 0 0 0 0 0 0 0 0
bus 1 0 2 1 0 0 0 1 0 0 0
by 1 2 0 1 2 0 0 1 0 0 0
day 0 1 1 0 1 0 0 0 1 0 0
every 0 0 2 1 0 0 0 1 1 1 2
go 0 0 0 0 0 0 2 0 1 1 2
i 0 0 0 0 0 2 0 0 0 0 2
night 0 1 1 0 1 0 0 0 0 1 0
school 0 0 0 1 1 1 0 0 0 0 1
theatre 0 0 0 0 1 1 0 1 0 0 1
to 0 0 0 0 2 2 2 0 1 1 0
[11 rows x 11 columns]
Now, we have our co-occurrence matrix.
You can use the ngram_range
parameter in the CountVectorizer
or TfidfVectorizer
Code example:
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2)) # by saying 2,2 you are telling you only want pairs of 2 words
In case you want to explicitly say which co-occurrences of words you want to count, use the vocabulary
param, i.e: vocabulary = {'awesome unicorns':0, 'batman forever':1}
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
Self-explanatory and ready to use code with predefined word-word co-occurrences. In this case we are tracking for co-occurrences of awesome unicorns
and batman forever
:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
samples = ['awesome unicorns are awesome','batman forever and ever','I love batman forever']
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), vocabulary = {'awesome unicorns':0, 'batman forever':1})
co_occurrences = bigram_vectorizer.fit_transform(samples)
print 'Printing sparse matrix:', co_occurrences
print 'Printing dense matrix (cols are vocabulary keys 0-> "awesome unicorns", 1-> "batman forever")', co_occurrences.todense()
sum_occ = np.sum(co_occurrences.todense(),axis=0)
print 'Sum of word-word occurrences:', sum_occ
print 'Pretty printig of co_occurrences count:', zip(bigram_vectorizer.get_feature_names(),np.array(sum_occ)[0].tolist())
Final output is ('awesome unicorns', 1), ('batman forever', 2)
, which corresponds exactly to our samples
provided data.
with numpy, as corpus would be list of lists (each list a tokenized document):
corpus = [['<START>', 'All', 'that', 'glitters', "isn't", 'gold', '<END>'],
['<START>', "All's", 'well', 'that', 'ends', 'well', '<END>']]
and a word->row/col mapping
def compute_co_occurrence_matrix(corpus, window_size):
words = sorted(list(set([word for words_list in corpus for word in words_list])))
num_words = len(words)
M = np.zeros((num_words, num_words))
word2Ind = dict(zip(words, range(num_words)))
for doc in corpus:
cur_idx = 0
doc_len = len(doc)
while cur_idx < doc_len:
left = max(cur_idx-window_size, 0)
right = min(cur_idx+window_size+1, doc_len)
words_to_add = doc[left:cur_idx] + doc[cur_idx+1:right]
focus_word = doc[cur_idx]
for word in words_to_add:
outside_idx = word2Ind[word]
M[outside_idx, word2Ind[focus_word]] += 1
cur_idx += 1
return M, word2Ind
I used the below code for creating co-occurrance matrix with window size:
#https://stackoverflow.com/questions/4843158/check-if-a-python-list-item-contains-a-string-inside-another-string
import pandas as pd
def co_occurance_matrix(input_text,top_words,window_size):
co_occur = pd.DataFrame(index=top_words, columns=top_words)
for row,nrow in zip(top_words,range(len(top_words))):
for colm,ncolm in zip(top_words,range(len(top_words))):
count = 0
if row == colm:
co_occur.iloc[nrow,ncolm] = count
else:
for single_essay in input_text:
essay_split = single_essay.split(" ")
max_len = len(essay_split)
top_word_index = [index for index, split in enumerate(essay_split) if row in split]
for index in top_word_index:
if index == 0:
count = count + essay_split[:window_size + 1].count(colm)
elif index == (max_len -1):
count = count + essay_split[-(window_size + 1):].count(colm)
else:
count = count + essay_split[index + 1 : (index + window_size + 1)].count(colm)
if index < window_size:
count = count + essay_split[: index].count(colm)
else:
count = count + essay_split[(index - window_size): index].count(colm)
co_occur.iloc[nrow,ncolm] = count
return co_occur
then i used the below code to perform test:
corpus = ['ABC DEF IJK PQR','PQR KLM OPQ','LMN PQR XYZ ABC DEF PQR ABC']
words = ['ABC','PQR','DEF']
window_size =2
result = co_occurance_matrix(corpus,words,window_size)
result
Output is here:
Here is my example solution using CountVectorizer
in scikit-learn. And referring to this post, you can simply use matrix multiplication to get word-word co-occurrence matrix.
from sklearn.feature_extraction.text import CountVectorizer
docs = ['this this this book',
'this cat good',
'cat good shit']
count_model = CountVectorizer(ngram_range=(1,1)) # default unigram model
X = count_model.fit_transform(docs)
# X[X > 0] = 1 # run this line if you don't want extra within-text cooccurence (see below)
Xc = (X.T * X) # this is co-occurrence matrix in sparse csr format
Xc.setdiag(0) # sometimes you want to fill same word cooccurence to 0
print(Xc.todense()) # print out matrix in dense format
You can also refer to dictionary of words in count_model
,
count_model.vocabulary_
Or, if you want to normalize by diagonal component (referred to answer in previous post).
import scipy.sparse as sp
Xc = (X.T * X)
g = sp.diags(1./Xc.diagonal())
Xc_norm = g * Xc # normalized co-occurence matrix
Extra to note @Federico Caccia answer, if you don't want co-occurrence that are spurious from the own text, set occurrence that is greater that 1 to 1 e.g.
X[X > 0] = 1 # do this line first before computing cooccurrence
Xc = (X.T * X)
...