Append tfidf to pandas dataframe

前端 未结 3 1753
旧时难觅i
旧时难觅i 2020-12-16 01:27

I have the following pandas structure:

col1 col2 col3 text
1    1    0    meaningful text
5    9    7    trees
7    8    2    text

I\'d lik

3条回答
  •  醉梦人生
    2020-12-16 02:23

    You can try the following -

    import numpy as np 
    import pandas as pd 
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    # create some data
    col1 = np.asarray(np.random.choice(10,size=(10)))
    col2 = np.asarray(np.random.choice(10,size=(10)))
    col3 = np.asarray(np.random.choice(10,size=(10)))
    text = ['Some models allow for specialized',
             'efficient parameter search strategies,',
             'outlined below. Two generic approaches',
             'to sampling search candidates are ',
             'provided in scikit-learn: for given values,',
             'GridSearchCV exhaustively considers all',
             'parameter combinations, while RandomizedSearchCV',
             'can sample a given number of candidates',
             ' from a parameter space with a specified distribution.',
             ' After describing these tools we detail best practice applicable to both approaches.']
    
    # create a dataframe from the the created data
    df = pd.DataFrame([col1,col2,col3,text]).T
    # set column names
    df.columns=['col1','col2','col3','text']
    
    tfidf_vec = TfidfVectorizer()
    tfidf_dense = tfidf_vec.fit_transform(df['text']).todense()
    new_cols = tfidf_vec.get_feature_names()
    
    # remove the text column as the word 'text' may exist in the words and you'll get an error
    df = df.drop('text',axis=1)
    # join the tfidf values to the existing dataframe
    df = df.join(pd.DataFrame(tfidf_dense, columns=new_cols))
    

提交回复
热议问题