Perform feature selection using pipeline and gridsearch

问题

As part of a research project, I want to select the best combination of preprocessing techniques and textual features that optimize the results of a text classification task. For this, I am using Python 3.6.

There are a number of methods to combine features and algorithms, but I want to take full advantage of sklearn's pipelines and test all the different (valid) possibilities using grid search for the ultimate feature combo.

My first step was to build a pipeline that looks like the following:

# Run a vectorizer with a predefined tweet tokenizer and a Naive Bayes

pipeline = Pipeline([
    ('vectorizer', CountVectorizer(tokenizer = tweet_tokenizer)),
    ('nb', MultinomialNB())
])

parameters = {
'vectorizer__preprocessor': (None, preprocessor)
}

gs =  GridSearchCV(pipeline, parameters, cv=5, n_jobs=-1, verbose=1)

In this simple example, the vectorizer tokenizes the data using tweet_tokenizer and then tests which option of preprocessing (None or a predefined function) results better.

This seems like a decent start, but I am now struggling to find a way to test all the different possibilities within the preprocessor function, defined below:

def preprocessor(tweet):
    # Data cleaning
    tweet = URL_remover(tweet) # Removing URLs
    tweet = mentions_remover(tweet) # Removing mentions
    tweet = email_remover(tweet) # Removing emails
    tweet = irrelev_chars_remover(tweet) # Removing invalid chars
    tweet = emojies_converter(tweet) # Translating emojies
    tweet = to_lowercase(tweet) # Converting words to lowercase
    # Others
    tweet = hashtag_decomposer(tweet) # Hashtag decomposition
    # Punctuation may only be removed after hashtag decomposition  
    # because it considers "#" as punctuation
    tweet = punct_remover(tweet) # Punctuation 
    return tweet

A "simple" solution to combine all the different processing techniques would be to create a different function for each possibility (e.g. funcA: proc1, funcB: proc1 + proc2, funcC: proc1 + proc3, etc.) and set the grid parameter as follows:

parameters = {
   'vectorizer__preprocessor': (None, funcA, funcB, funcC, ...)
}

Although this would most likely work, this isn't a viable or reasonable solution for this task, especially since there are 2^n_features different combinations and, consequently, functions.

The ultimate goal is to combine both preprocessing techniques and features in a pipeline in order to optimize the results of the classification using gridsearch:

pipeline = Pipeline([
    ('vectorizer', CountVectorizer(tokenizer = tweet_tokenizer)),
    ('feat_extractor' , feat_extractor)
    ('nb', MultinomialNB())
])

 parameters = {
   'vectorizer__preprocessor': (None, funcA, funcB, funcC, ...)
   'feat_extractor': (None, func_A, func_B, func_C, ...)
 }

Is there a simpler way to obtain this?

回答1:

This solution is very rough based on your description and specific to the answer depending on the type of data used. Before making the pipeline, lets understand how the CountVectorizer works on the raw_documents that are passed in it. Essentially, this is the line that processes the string documents into tokens,

return lambda doc: self._word_ngrams(tokenize(preprocess(self.decode(doc))), stop_words)

which are then just counted and converted to count matrix.

So what happens here is:

decode: Just decide how to read the data from file (if specified). Not of use to us, where we already have data into list.
preprocess: It does the following if 'strip_accents' and 'lowercase' are True in CountVectorizer. Else nothing
```
strip_accents(x.lower())
```
Again, no use, because we are moving the lowercase functionality to our own preprocessor and dont need to strip accents because we already have data in list of strings.
tokenize: Will remove all punctuations and retain only alphanumeric words of length 2 or more, and return a list of tokens for single document (element of list)
```
lambda doc: token_pattern.findall(doc)
```
This should be kept in mind. If you want to handle the punctuation and other symbols yourself (deciding on keeping some and removing others), then better also change the default token_pattern=’(?u)\b\w\w+\b’ of CountVectorizer.
1. _word_ngrams: This method will first remove the stop words (supplied as parameter above) from the list of tokens from the previous step and then calculate the n_grams as defined by the ngram_range param in CountVectorizer. This should also be kept in mind, if you want to handle the "n_grams" your way.

Note: If the analyzer is set to 'char', then the tokenizer step will be not be performed and n_grams will be made from characters.

So now coming to our pipeline. This is the structure I am thinking can work here:

X --> combined_pipeline, Pipeline
            |
            |  Raw data is passed to Preprocessor
            |
            \/
         Preprocessor 
                 |
                 |  Cleaned data (still raw texts) is passed to FeatureUnion
                 |
                 \/
              FeatureUnion
                      |
                      |  Data is duplicated and passed to both parts
       _______________|__________________
      |                                  |
      |                                  |                         
      \/                                \/
   CountVectorizer                  FeatureExtractor
           |                                  |   
           |   Converts raw to                |   Extracts numerical features
           |   count-matrix                   |   from raw data
           \/________________________________\/
                             |
                             | FeatureUnion combines both the matrices
                             |
                             \/
                          Classifier

Now coming to code. This is what the pipeline looks like:

# Imports
from sklearn.svm import SVC
from sklearn.pipeline import FeatureUnion, Pipeline

# Pipeline
pipe = Pipeline([('preprocessor', CustomPreprocessor()), 
                 ('features', FeatureUnion([("vectorizer", CountVectorizer()),
                                            ("extractor", CustomFeatureExtractor())
                                            ]))
                 ('classifier', SVC())
                ])

Where CustomPreprocessor and CustomFeatureExtractor are defined as:

from sklearn.base import TransformerMixin, BaseEstimator

class CustomPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self, remove_urls=True, remove_mentions=True, 
                 remove_emails=True, remove_invalid_chars=True, 
                 convert_emojis=True, lowercase=True, 
                 decompose_hashtags=True, remove_punctuations=True):
        self.remove_urls=remove_urls
        self.remove_mentions=remove_mentions
        self.remove_emails=remove_emails
        self.remove_invalid_chars=remove_invalid_chars
        self.convert_emojis=convert_emojis
        self.lowercase=lowercase
        self.decompose_hashtags=decompose_hashtags
        self.remove_punctuations=remove_punctuations

    # You Need to have all the functions ready
    # This method works on single tweets
    def preprocessor(self, tweet):
        # Data cleaning
        if self.remove_urls:
            tweet = URL_remover(tweet) # Removing URLs

        if self.remove_mentions:
            tweet = mentions_remover(tweet) # Removing mentions

        if self.remove_emails:
            tweet = email_remover(tweet) # Removing emails

        if self.remove_invalid_chars:
            tweet = irrelev_chars_remover(tweet) # Removing invalid chars

        if self.convert_emojis:
            tweet = emojies_converter(tweet) # Translating emojies

        if self.lowercase:
            tweet = to_lowercase(tweet) # Converting words to lowercase

        if self.decompose_hashtags:
            # Others
            tweet = hashtag_decomposer(tweet) # Hashtag decomposition

        # Punctuation may only be removed after hashtag decomposition  
        # because it considers "#" as punctuation
        if self.remove_punctuations:
            tweet = punct_remover(tweet) # Punctuation 

        return tweet

    def fit(self, raw_docs, y=None):
        # Noop - We dont learn anything about the data
        return self

    def transform(self, raw_docs):
        return [self.preprocessor(tweet) for tweet in raw_docs]

from textblob import TextBlob
import numpy as np
# Same thing for feature extraction
class CustomFeatureExtractor(BaseEstimator, TransformerMixin):
    def __init__(self, sentiment_analysis=True, tweet_length=True):
        self.sentiment_analysis=sentiment_analysis
        self.tweet_length=tweet_length

    # This method works on single tweets
    def extractor(self, tweet):
        features = []

        if self.sentiment_analysis:
            blob = TextBlob(tweet)
            features.append(blob.sentiment.polarity)

        if self.tweet_length:
            features.append(len(tweet))

        # Do for other features you want.

        return np.array(features)

    def fit(self, raw_docs, y):
        # Noop - Again I am assuming that We dont learn anything about the data
        # Definitely not for tweet length, and also not for sentiment analysis
        # Or any other thing you might have here.
        return self

    def transform(self, raw_docs):
        # I am returning a numpy array so that the FeatureUnion can handle that correctly
        return np.vstack(tuple([self.extractor(tweet) for tweet in raw_docs]))

Finally, the parameter grid can be now done easily like:

param_grid = ['preprocessor__remove_urls':[True, False],
              'preprocessor__remove_mentions':[True, False],
              ...
              ...
              # No need to search for lowercase or preprocessor in CountVectorizer 
              'features__vectorizer__max_df':[0.1, 0.2, 0.3],
              ...
              ...
              'features__extractor__sentiment_analysis':[True, False],
              'features__extractor__tweet_length':[True, False],
              ...
              ...
              'classifier__C':[0.01, 0.1, 1.0]
            ]

The above code is to avoid "to create a different function for each possibility (e.g. funcA: proc1, funcB: proc1 + proc2, funcC: proc1 + proc3, etc.)". Just do True, False and GridSearchCV will handle that.

Update: If you dont want to have the CountVectorizer, then you can remove that from the pipeline and parameter grid and new pipeline will be:

pipe = Pipeline([('preprocessor', CustomPreprocessor()), 
                 ("extractor", CustomFeatureExtractor()),
                 ('classifier', SVC())
                ])

Then make sure to implement all the functionalities you want in CustomFeatureExtractor. If that becomes too complex, then you can always make simpler extractors and combine them together in the FeatureUnion in place of CountVectorizer

来源：https://stackoverflow.com/questions/53841913/perform-feature-selection-using-pipeline-and-gridsearch

标签

python

scikit-learn

pipeline

feature-selection

grid-search