How to make pipeline for multiple dataframe columns?

前端 未结 3 1484
日久生厌
日久生厌 2020-12-21 08:29

I have Dataframe which can be simplified to this:

import pandas as pd

df = pd.DataFrame([{
\'title\': \'batman\',
\'text\': \'man bat man bat\', 
\'url\': \         


        
相关标签:
3条回答
  • 2020-12-21 09:02

    Take a look at the following link: http://scikit-learn.org/0.18/auto_examples/hetero_feature_union.html

    class ItemSelector(BaseEstimator, TransformerMixin):
    def __init__(self, key):
        self.key = key
    
    def fit(self, x, y=None):
        return self
    
    def transform(self, data_dict):
        return data_dict[self.key]
    

    The key value accepts a panda dataframe column label. When using it in your pipeline it can be applied as:

    ('tfidf_word', Pipeline([
                ('selector', ItemSelector(key='column_name')),
                ('tfidf', TfidfVectorizer())), 
                ]))
    
    0 讨论(0)
  • 2020-12-21 09:08

    @elphz answer is a good intro to how you could use FeatureUnion and FunctionTransformer to accomplish this, but I think it could use a little more detail.

    First off I would say you need to define your FunctionTransformer functions such that they can handle and return your input data properly. In this case I assume you just want to pass the DataFrame, but ensure that you get back a properly shaped array for use downstream. Therefore I would propose passing just the DataFrame and accessing by column name. Like so:

    def text(X):
        return X.text.values
    
    def title(X):
        return X.title.values
    
    pipe_text = Pipeline([('col_text', FunctionTransformer(text, validate=False))])
    
    pipe_title = Pipeline([('col_title', FunctionTransformer(title, validate=False))])
    

    Now, to test the variations of transformers and classifiers. I would propose using a list of transformers and a list of classifiers and simply iterating through them, much like a gridsearch.

    tfidf = TfidfVectorizer()
    cv = CountVectorizer()
    lr = LogisticRegression()
    rc = RidgeClassifier()
    
    transformers = [('tfidf', tfidf), ('cv', cv)]
    clfs = [lr, rc]
    
    best_clf = None
    best_score = 0
    for tran1 in transformers:
        for tran2 in transformers:
            pipe1 = Pipeline(pipe_text.steps + [tran1])
            pipe2 = Pipeline(pipe_title.steps + [tran2])
            union = FeatureUnion([('text', pipe1), ('title', pipe2)])
            X = union.fit_transform(df)
            X_train, X_test, y_train, y_test = train_test_split(X, df.label)
            for clf in clfs:
                clf.fit(X_train, y_train)
                score = clf.score(X_test, y_test)
                if score > best_score:
                    best_score = score
                    best_est = clf
    

    This is a simple example, but you can see how you could plug in any variety of transformations and classifiers in this way.

    0 讨论(0)
  • 2020-12-21 09:10

    I would use a combination of FunctionTransformer to select only certain columns, and then FeatureUnion to combine TFIDF, word count, etc features on each column. There may be a slightly cleaner way, but I think you'll end up with some sort of FeatureUnion and Pipeline nesting regardless.

    from sklearn.preprocessing import FunctionTransformer
    from sklearn.pipeline import FeatureUnion, Pipeline
    from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
    
    def first_column(X):
        return X.iloc[:, 0]
    
    def second_column(X):
        return X.iloc[:, 1]
    
    # pipeline to get all tfidf and word count for first column
    pipeline_one = Pipeline([
        ('column_selection', FunctionTransformer(first_column, validate=False)),
        ('feature-extractors', FeatureUnion([('tfidf', TfidfVectorizer()),
                                            ('counts', CountVectorizer())
    
        ]))
    ])
    
    # Then a second pipeline to do the same for the second column
    pipeline_two = Pipeline([
        ('column_selection', FunctionTransformer(second_column, validate=False)),
        ('feature-extractors', FeatureUnion([('tfidf', TfidfVectorizer()),
                                            ('counts', CountVectorizer())
    
        ]))
    ])
    
    
    # Then you would again feature union these pipelines 
    # to get different feature selection for each column
    final_transformer = FeatureUnion([('first-column-features', pipeline_one),
                                      ('second-column-feature', pipeline_two)])
    
    # Your dataframe has your target as the first column, so make sure to drop first
    y = df['label']
    df = df.drop('label', axis=1)
    
    # Now fit transform should work
    final_transformer.fit_transform(df)
    

    If you don't want to apply multiple transformer to each column (tfidf and counts both likely won't be useful) then you could cut down on the nesting by one step.

    0 讨论(0)
提交回复
热议问题