TfIdfVectorizer: How does the vectorizer with fixed vocab deal with new words?

前端 未结 1 431
星月不相逢
星月不相逢 2021-01-05 12:03

I\'m working on a corpus of ~100k research papers. I\'m considering three fields:

  1. plaintext
  2. title
  3. abstract

I used the TfIdfVec

1条回答
  •  北荒
    北荒 (楼主)
    2021-01-05 12:27

    I'm afraid the matrix might be too large. It would be 96582*96582=9328082724 cells. Try to slice titles_tfidf a bit and check.

    Source: http://scipy-user.10969.n7.nabble.com/SciPy-User-strange-error-when-creating-csr-matrix-td20129.html

    EDT: If you are using older SciPy/Numpy version you might want to update: https://github.com/scipy/scipy/pull/4678

    EDT2: Also if you are using 32bit python, switching to 64bit might help (I suppose)

    EDT3: Answering your original question. When you use vocabulary from plaintexts and there will be new words in titles they will be ignored - but not influence tfidf value. Hope this snippet may make it more understandable:

    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
    
    plaintexts =["They are", "plain texts texts amoersand here"]
    titles = ["And here", "titles ", "wolf dog eagle", "But here plain"]
    
    vectorizer = TfidfVectorizer()
    plaintexts_tfidf = vectorizer.fit_transform(plaintexts)
    vocab = vectorizer.vocabulary_
    vectorizer = TfidfVectorizer(vocabulary=vocab)
    titles_tfidf = vectorizer.fit_transform(titles)
    print('values using vocabulary')
    print(titles_tfidf)
    print(vectorizer.get_feature_names())
    print('Brand new vectorizer')
    vectorizer = TfidfVectorizer()
    titles_tfidf = vectorizer.fit_transform(titles)
    print(titles_tfidf)
    print(vectorizer.get_feature_names())
    

    Result is:

    values using vocabulary
      (0, 2)        1.0
      (3, 3)        0.78528827571
      (3, 2)        0.61913029649
    ['amoersand', 'are', 'here', 'plain', 'texts', 'they']
    Brand new vectorizer
      (0, 0)        0.78528827571
      (0, 4)        0.61913029649
      (1, 6)        1.0
      (2, 7)        0.57735026919
      (2, 2)        0.57735026919
      (2, 3)        0.57735026919
      (3, 4)        0.486934264074
      (3, 1)        0.617614370976
      (3, 5)        0.617614370976
    ['and', 'but', 'dog', 'eagle', 'here', 'plain', 'titles', 'wolf']
    

    Notice it is not the same as I would remove words that not occur in plaintexts from titles.

    0 讨论(0)
提交回复
热议问题