sklearn: How to speed up a vectorizer (eg Tfidfvectorizer)

前端 未结 1 1531
眼角桃花
眼角桃花 2021-01-02 09:22

After thoroughly profiling my program, I have been able to pinpoint that it is being slowed down by the vectorizer.

I am working on text data, and two lines of simpl

相关标签:
1条回答
  • 2021-01-02 09:55

    Unsurprisingly, it's NLTK that is slow:

    >>> tfidf = StemmedTfidfVectorizer(min_df=1, stop_words='english', analyzer='word', ngram_range=(1,1))
    >>> %timeit tfidf.fit_transform(X_train)
    1 loops, best of 3: 4.89 s per loop
    >>> tfidf = TfidfVectorizer(min_df=1, stop_words='english', analyzer='word', ngram_range=(1,1))
    >>> %timeit tfidf.fit_transform(X_train)
    1 loops, best of 3: 415 ms per loop
    

    You can speed this up by using a smarter implementation of the Snowball stemmer, e.g., PyStemmer:

    >>> import Stemmer
    >>> english_stemmer = Stemmer.Stemmer('en')
    >>> class StemmedTfidfVectorizer(TfidfVectorizer):
    ...     def build_analyzer(self):
    ...         analyzer = super(TfidfVectorizer, self).build_analyzer()
    ...         return lambda doc: english_stemmer.stemWords(analyzer(doc))
    ...     
    >>> tfidf = StemmedTfidfVectorizer(min_df=1, stop_words='english', analyzer='word', ngram_range=(1,1))
    >>> %timeit tfidf.fit_transform(X_train)
    1 loops, best of 3: 650 ms per loop
    

    NLTK is a teaching toolkit. It's slow by design, because it's optimized for readability.

    0 讨论(0)
提交回复
热议问题