Append tfidf to pandas dataframe

前端 未结 3 1754
旧时难觅i
旧时难觅i 2020-12-16 01:27

I have the following pandas structure:

col1 col2 col3 text
1    1    0    meaningful text
5    9    7    trees
7    8    2    text

I\'d lik

相关标签:
3条回答
  • 2020-12-16 02:08

    I would like to add some information to the accepted answer.

    Before concatenating the two DataFrames (i.e. main DataFrame and TF-IDF DataFrame), make sure that the indices between the two DataFrames are similar. For instance, you can use df.reset_index(drop=True, inplace=True) to reset the DataFrame index.

    Otherwise, your concatenated DataFrames will contain a lot of NaN rows. Having looked at the comments, this is probably what the OP experienced.

    0 讨论(0)
  • 2020-12-16 02:23

    You can try the following -

    import numpy as np 
    import pandas as pd 
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    # create some data
    col1 = np.asarray(np.random.choice(10,size=(10)))
    col2 = np.asarray(np.random.choice(10,size=(10)))
    col3 = np.asarray(np.random.choice(10,size=(10)))
    text = ['Some models allow for specialized',
             'efficient parameter search strategies,',
             'outlined below. Two generic approaches',
             'to sampling search candidates are ',
             'provided in scikit-learn: for given values,',
             'GridSearchCV exhaustively considers all',
             'parameter combinations, while RandomizedSearchCV',
             'can sample a given number of candidates',
             ' from a parameter space with a specified distribution.',
             ' After describing these tools we detail best practice applicable to both approaches.']
    
    # create a dataframe from the the created data
    df = pd.DataFrame([col1,col2,col3,text]).T
    # set column names
    df.columns=['col1','col2','col3','text']
    
    tfidf_vec = TfidfVectorizer()
    tfidf_dense = tfidf_vec.fit_transform(df['text']).todense()
    new_cols = tfidf_vec.get_feature_names()
    
    # remove the text column as the word 'text' may exist in the words and you'll get an error
    df = df.drop('text',axis=1)
    # join the tfidf values to the existing dataframe
    df = df.join(pd.DataFrame(tfidf_dense, columns=new_cols))
    
    0 讨论(0)
  • 2020-12-16 02:30

    You can proceed as follows:

    Load data into a dataframe:

    import pandas as pd
    
    df = pd.read_table("/tmp/test.csv", sep="\s+")
    print(df)
    

    Output:

       col1  col2  col3             text
    0     1     1     0  meaningful text
    1     5     9     7            trees
    2     7     8     2             text
    

    Tokenize the text column using: sklearn.feature_extraction.text.TfidfVectorizer

    from sklearn.feature_extraction.text import TfidfVectorizer
    
    v = TfidfVectorizer()
    x = v.fit_transform(df['text'])
    

    Convert the tokenized data into a dataframe:

    df1 = pd.DataFrame(x.toarray(), columns=v.get_feature_names())
    print(df1)
    

    Output:

       meaningful      text  trees
    0    0.795961  0.605349    0.0
    1    0.000000  0.000000    1.0
    2    0.000000  1.000000    0.0
    

    Concatenate the tokenization dataframe to the orignal one:

    res = pd.concat([df, df1], axis=1)
    print(res)
    

    Output:

       col1  col2  col3             text  meaningful      text  trees
    0     1     1     0  meaningful text    0.795961  0.605349    0.0
    1     5     9     7            trees    0.000000  0.000000    1.0
    2     7     8     2             text    0.000000  1.000000    0.0
    

    If you want to drop the column text, you need to do that before the concatenation:

    df.drop('text', axis=1, inplace=True)
    res = pd.concat([df, df1], axis=1)
    print(res)
    

    Output:

       col1  col2  col3  meaningful      text  trees
    0     1     1     0    0.795961  0.605349    0.0
    1     5     9     7    0.000000  0.000000    1.0
    2     7     8     2    0.000000  1.000000    0.0
    

    Here's the full code:

    import pandas as pd
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    df = pd.read_table("/tmp/test.csv", sep="\s+")
    v = TfidfVectorizer()
    x = v.fit_transform(df['text'])
    
    df1 = pd.DataFrame(x.toarray(), columns=v.get_feature_names())
    df.drop('text', axis=1, inplace=True)
    res = pd.concat([df, df1], axis=1)
    
    0 讨论(0)
提交回复
热议问题