Efficient way to create term density matrix from pandas DataFrame

前端 未结 2 1975
自闭症患者
自闭症患者 2021-02-06 07:15

I am trying to create a term density matrix from a pandas dataframe, so I can rate terms appearing in the dataframe. I also want to be able to keep the \'spatial\' aspect of my

相关标签:
2条回答
  • 2021-02-06 07:49

    herrfz provides a way to handle this but I just wanted to point out that creating a term density data structure using a Python set is counterproductive, seeing as a set is a collection of unique objects. You won't be able to capture the count for each word, only the presence of a word for a given row.

    return set(nltk.wordpunct_tokenize(strin)).difference(sw)
    

    In order to strip out the stopwords you could do something like

    tokens_stripped = [token for token in tokens 
                       if token not in stopwords]
    

    after tokenization.

    0 讨论(0)
  • 2021-02-06 08:09

    You can use scikit-learn's CountVectorizer:

    In [14]: from sklearn.feature_extraction.text import CountVectorizer
    
    In [15]: countvec = CountVectorizer()
    
    In [16]: countvec.fit_transform(df.title)
    Out[16]: 
    <4x8 sparse matrix of type '<type 'numpy.int64'>'
        with 9 stored elements in Compressed Sparse Column format>
    

    It returns the term document matrix in sparse representation because such matrix is usually huge and, well, sparse.

    For your particular example I guess converting it back to a DataFrame would still work:

    In [17]: pd.DataFrame(countvec.fit_transform(df.title).toarray(), columns=countvec.get_feature_names())
    Out[17]: 
       boiled  delicious  egg  else  fried  orange  something  split
    0       1          1    1     0      0       0          0      0
    1       0          0    1     0      1       0          0      0
    2       0          0    0     0      0       1          0      1
    3       0          0    0     1      0       0          1      0
    
    [4 rows x 8 columns]
    
    0 讨论(0)
提交回复
热议问题