Combining bag of words and other features in one model using sklearn and pandas

前端 未结 1 2021
名媛妹妹
名媛妹妹 2021-01-31 04:38

I am trying to model the score that a post receives, based on both the text of the post, and other features (time of day, length of post, etc.)

I am wondering how to bes

相关标签:
1条回答
  • 2021-01-31 05:02

    You could do everything with your map and lambda:

    tokenized=map(lambda msg, ft1, ft2: features([msg,ft1,ft2]), posts.message,posts.feature_1, posts.feature_2)
    

    This saves doing your interim temp step and iterates through the 3 columns.

    Another solution would be convert the messages into their CountVectorizer sparse matrix and join this matrix with the feature values from the posts dataframe (this skips having to construct a dict and produces a sparse matrix similar to what you would get with DictVectorizer):

    import scipy as sp
    posts = pd.read_csv('post.csv')
    
    # Create vectorizer for function to use
    vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2))
    y = posts["score"].values.astype(np.float32) 
    
    X = sp.sparse.hstack((vectorizer.fit_transform(posts.message),posts[['feature_1','feature_2']].values),format='csr')
    X_columns=vectorizer.get_feature_names()+posts[['feature_1','feature_2']].columns.tolist()
    
    
    posts
    Out[38]: 
       ID              message  feature_1  feature_2  score
    0   1   'This is the text'          4          7     10
    1   2  'This is more text'          3          2      9
    2   3   'More random text'          3          2      9
    
    X_columns
    Out[39]: 
    [u'is',
     u'is more',
     u'is the',
     u'more',
     u'more random',
     u'more text',
     u'random',
     u'random text',
     u'text',
     u'the',
     u'the text',
     u'this',
     u'this is',
     'feature_1',
     'feature_2']
    
    X.toarray()
    Out[40]: 
    array([[1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 4, 7],
           [1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 3, 2],
           [0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 3, 2]])
    

    Additionally sklearn-pandas has DataFrameMapper which does what you're looking for too:

    from sklearn_pandas import DataFrameMapper
    mapper = DataFrameMapper([
        (['feature_1', 'feature_2'], None),
        ('message',CountVectorizer(binary=True, ngram_range=(1, 2)))
    ])
    X=mapper.fit_transform(posts)
    
    X
    Out[71]: 
    array([[4, 7, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
           [3, 2, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1],
           [3, 2, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0]])
    

    Note:X is not sparse when using this last method.

    X_columns=mapper.features[0][0]+mapper.features[1][1].get_feature_names()
    
    X_columns
    Out[76]: 
    ['feature_1',
     'feature_2',
     u'is',
     u'is more',
     u'is the',
     u'more',
     u'more random',
     u'more text',
     u'random',
     u'random text',
     u'text',
     u'the',
     u'the text',
     u'this',
     u'this is']
    
    0 讨论(0)
提交回复
热议问题