I\'m working on a corpus of ~100k research papers. I\'m considering three fields:
I used the TfIdfVec
I'm afraid the matrix might be too large. It would be 96582*96582=9328082724 cells. Try to slice titles_tfidf a bit and check.
Source: http://scipy-user.10969.n7.nabble.com/SciPy-User-strange-error-when-creating-csr-matrix-td20129.html
EDT: If you are using older SciPy/Numpy version you might want to update: https://github.com/scipy/scipy/pull/4678
EDT2: Also if you are using 32bit python, switching to 64bit might help (I suppose)
EDT3:
Answering your original question. When you use vocabulary from plaintexts
and there will be new words in titles
they will be ignored - but not influence tfidf value. Hope this snippet may make it more understandable:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
plaintexts =["They are", "plain texts texts amoersand here"]
titles = ["And here", "titles ", "wolf dog eagle", "But here plain"]
vectorizer = TfidfVectorizer()
plaintexts_tfidf = vectorizer.fit_transform(plaintexts)
vocab = vectorizer.vocabulary_
vectorizer = TfidfVectorizer(vocabulary=vocab)
titles_tfidf = vectorizer.fit_transform(titles)
print('values using vocabulary')
print(titles_tfidf)
print(vectorizer.get_feature_names())
print('Brand new vectorizer')
vectorizer = TfidfVectorizer()
titles_tfidf = vectorizer.fit_transform(titles)
print(titles_tfidf)
print(vectorizer.get_feature_names())
Result is:
values using vocabulary
(0, 2) 1.0
(3, 3) 0.78528827571
(3, 2) 0.61913029649
['amoersand', 'are', 'here', 'plain', 'texts', 'they']
Brand new vectorizer
(0, 0) 0.78528827571
(0, 4) 0.61913029649
(1, 6) 1.0
(2, 7) 0.57735026919
(2, 2) 0.57735026919
(2, 3) 0.57735026919
(3, 4) 0.486934264074
(3, 1) 0.617614370976
(3, 5) 0.617614370976
['and', 'but', 'dog', 'eagle', 'here', 'plain', 'titles', 'wolf']
Notice it is not the same as I would remove words that not occur in plaintexts from titles.