问题
As part of a research project, I want to select the best combination of preprocessing techniques and textual features that optimize the results of a text classification task. For this, I am using Python 3.6.
There are a number of methods to combine features and algorithms, but I want to take full advantage of sklearn's pipelines and test all the different (valid) possibilities using grid search for the ultimate feature combo.
My first step was to build a pipeline that looks like the following:
# Run a vectorizer with a predefined tweet tokenizer and a Naive Bayes
pipeline = Pipeline([
('vectorizer', CountVectorizer(tokenizer = tweet_tokenizer)),
('nb', MultinomialNB())
])
parameters = {
'vectorizer__preprocessor': (None, preprocessor)
}
gs = GridSearchCV(pipeline, parameters, cv=5, n_jobs=-1, verbose=1)
In this simple example, the vectorizer tokenizes the data using tweet_tokenizer and then tests which option of preprocessing (None or a predefined function) results better.
This seems like a decent start, but I am now struggling to find a way to test all the different possibilities within the preprocessor function, defined below:
def preprocessor(tweet):
# Data cleaning
tweet = URL_remover(tweet) # Removing URLs
tweet = mentions_remover(tweet) # Removing mentions
tweet = email_remover(tweet) # Removing emails
tweet = irrelev_chars_remover(tweet) # Removing invalid chars
tweet = emojies_converter(tweet) # Translating emojies
tweet = to_lowercase(tweet) # Converting words to lowercase
# Others
tweet = hashtag_decomposer(tweet) # Hashtag decomposition
# Punctuation may only be removed after hashtag decomposition
# because it considers "#" as punctuation
tweet = punct_remover(tweet) # Punctuation
return tweet
A "simple" solution to combine all the different processing techniques would be to create a different function for each possibility (e.g. funcA: proc1, funcB: proc1 + proc2, funcC: proc1 + proc3, etc.) and set the grid parameter as follows:
parameters = {
'vectorizer__preprocessor': (None, funcA, funcB, funcC, ...)
}
Although this would most likely work, this isn't a viable or reasonable solution for this task, especially since there are 2^n_features
different combinations and, consequently, functions.
The ultimate goal is to combine both preprocessing techniques and features in a pipeline in order to optimize the results of the classification using gridsearch:
pipeline = Pipeline([
('vectorizer', CountVectorizer(tokenizer = tweet_tokenizer)),
('feat_extractor' , feat_extractor)
('nb', MultinomialNB())
])
parameters = {
'vectorizer__preprocessor': (None, funcA, funcB, funcC, ...)
'feat_extractor': (None, func_A, func_B, func_C, ...)
}
Is there a simpler way to obtain this?
回答1:
This solution is very rough based on your description and specific to the answer depending on the type of data used. Before making the pipeline, lets understand how the CountVectorizer
works on the raw_documents
that are passed in it. Essentially, this is the line that processes the string documents into tokens,
return lambda doc: self._word_ngrams(tokenize(preprocess(self.decode(doc))), stop_words)
which are then just counted and converted to count matrix.
So what happens here is:
decode
: Just decide how to read the data from file (if specified). Not of use to us, where we already have data into list.preprocess
: It does the following if'strip_accents'
and'lowercase'
areTrue
inCountVectorizer
. Else nothingstrip_accents(x.lower())
Again, no use, because we are moving the lowercase functionality to our own preprocessor and dont need to strip accents because we already have data in list of strings.
tokenize
: Will remove all punctuations and retain only alphanumeric words of length 2 or more, and return a list of tokens for single document (element of list)lambda doc: token_pattern.findall(doc)
This should be kept in mind. If you want to handle the punctuation and other symbols yourself (deciding on keeping some and removing others), then better also change the default
token_pattern=’(?u)\b\w\w+\b’
ofCountVectorizer
._word_ngrams
: This method will first remove the stop words (supplied as parameter above) from the list of tokens from the previous step and then calculate the n_grams as defined by thengram_range
param inCountVectorizer
. This should also be kept in mind, if you want to handle the"n_grams"
your way.
Note: If the analyzer is set to 'char'
, then the tokenizer
step will be not be performed and n_grams will be made from characters.
So now coming to our pipeline. This is the structure I am thinking can work here:
X --> combined_pipeline, Pipeline
|
| Raw data is passed to Preprocessor
|
\/
Preprocessor
|
| Cleaned data (still raw texts) is passed to FeatureUnion
|
\/
FeatureUnion
|
| Data is duplicated and passed to both parts
_______________|__________________
| |
| |
\/ \/
CountVectorizer FeatureExtractor
| |
| Converts raw to | Extracts numerical features
| count-matrix | from raw data
\/________________________________\/
|
| FeatureUnion combines both the matrices
|
\/
Classifier
Now coming to code. This is what the pipeline looks like:
# Imports
from sklearn.svm import SVC
from sklearn.pipeline import FeatureUnion, Pipeline
# Pipeline
pipe = Pipeline([('preprocessor', CustomPreprocessor()),
('features', FeatureUnion([("vectorizer", CountVectorizer()),
("extractor", CustomFeatureExtractor())
]))
('classifier', SVC())
])
Where CustomPreprocessor
and CustomFeatureExtractor
are defined as:
from sklearn.base import TransformerMixin, BaseEstimator
class CustomPreprocessor(BaseEstimator, TransformerMixin):
def __init__(self, remove_urls=True, remove_mentions=True,
remove_emails=True, remove_invalid_chars=True,
convert_emojis=True, lowercase=True,
decompose_hashtags=True, remove_punctuations=True):
self.remove_urls=remove_urls
self.remove_mentions=remove_mentions
self.remove_emails=remove_emails
self.remove_invalid_chars=remove_invalid_chars
self.convert_emojis=convert_emojis
self.lowercase=lowercase
self.decompose_hashtags=decompose_hashtags
self.remove_punctuations=remove_punctuations
# You Need to have all the functions ready
# This method works on single tweets
def preprocessor(self, tweet):
# Data cleaning
if self.remove_urls:
tweet = URL_remover(tweet) # Removing URLs
if self.remove_mentions:
tweet = mentions_remover(tweet) # Removing mentions
if self.remove_emails:
tweet = email_remover(tweet) # Removing emails
if self.remove_invalid_chars:
tweet = irrelev_chars_remover(tweet) # Removing invalid chars
if self.convert_emojis:
tweet = emojies_converter(tweet) # Translating emojies
if self.lowercase:
tweet = to_lowercase(tweet) # Converting words to lowercase
if self.decompose_hashtags:
# Others
tweet = hashtag_decomposer(tweet) # Hashtag decomposition
# Punctuation may only be removed after hashtag decomposition
# because it considers "#" as punctuation
if self.remove_punctuations:
tweet = punct_remover(tweet) # Punctuation
return tweet
def fit(self, raw_docs, y=None):
# Noop - We dont learn anything about the data
return self
def transform(self, raw_docs):
return [self.preprocessor(tweet) for tweet in raw_docs]
from textblob import TextBlob
import numpy as np
# Same thing for feature extraction
class CustomFeatureExtractor(BaseEstimator, TransformerMixin):
def __init__(self, sentiment_analysis=True, tweet_length=True):
self.sentiment_analysis=sentiment_analysis
self.tweet_length=tweet_length
# This method works on single tweets
def extractor(self, tweet):
features = []
if self.sentiment_analysis:
blob = TextBlob(tweet)
features.append(blob.sentiment.polarity)
if self.tweet_length:
features.append(len(tweet))
# Do for other features you want.
return np.array(features)
def fit(self, raw_docs, y):
# Noop - Again I am assuming that We dont learn anything about the data
# Definitely not for tweet length, and also not for sentiment analysis
# Or any other thing you might have here.
return self
def transform(self, raw_docs):
# I am returning a numpy array so that the FeatureUnion can handle that correctly
return np.vstack(tuple([self.extractor(tweet) for tweet in raw_docs]))
Finally, the parameter grid can be now done easily like:
param_grid = ['preprocessor__remove_urls':[True, False],
'preprocessor__remove_mentions':[True, False],
...
...
# No need to search for lowercase or preprocessor in CountVectorizer
'features__vectorizer__max_df':[0.1, 0.2, 0.3],
...
...
'features__extractor__sentiment_analysis':[True, False],
'features__extractor__tweet_length':[True, False],
...
...
'classifier__C':[0.01, 0.1, 1.0]
]
The above code is to avoid "to create a different function for each possibility (e.g. funcA: proc1, funcB: proc1 + proc2, funcC: proc1 + proc3, etc.)
". Just do True, False and GridSearchCV will handle that.
Update:
If you dont want to have the CountVectorizer
, then you can remove that from the pipeline and parameter grid and new pipeline will be:
pipe = Pipeline([('preprocessor', CustomPreprocessor()),
("extractor", CustomFeatureExtractor()),
('classifier', SVC())
])
Then make sure to implement all the functionalities you want in CustomFeatureExtractor
. If that becomes too complex, then you can always make simpler extractors and combine them together in the FeatureUnion in place of CountVectorizer
来源:https://stackoverflow.com/questions/53841913/perform-feature-selection-using-pipeline-and-gridsearch