How to remove stop phrases/stop ngrams (multi-word strings) using pandas/sklearn?

问题

I want to prevent certain phrases for creeping into my models. For example, I want to prevent 'red roses' from entering into my analysis. I understand how to add individual stop words as given in Adding words to scikit-learn's CountVectorizer's stop list by doing so:

from sklearn.feature_extraction import text
additional_stop_words=['red','roses']

However, this also results in other ngrams like 'red tulips' or 'blue roses' not being detected.

I am building a TfidfVectorizer as part of my model, and I realize the processing I need might have to be entered after this stage but I am not sure how to do this.

My eventual aim is to do topic modelling on a piece of text. Here is the piece of code (borrowed almost directly from https://de.dariah.eu/tatom/topic_model_python.html#index-0 ) that I am working on:

from sklearn import decomposition

from sklearn.feature_extraction import text
additional_stop_words = ['red', 'roses']

sw = text.ENGLISH_STOP_WORDS.union(additional_stop_words)
mod_vectorizer = text.TfidfVectorizer(
    ngram_range=(2,3),
    stop_words=sw,
    norm='l2',
    min_df=5
)

dtm = mod_vectorizer.fit_transform(df[col]).toarray()
vocab = np.array(mod_vectorizer.get_feature_names())
num_topics = 5
num_top_words = 5
m_clf = decomposition.LatentDirichletAllocation(
    n_topics=num_topics,
    random_state=1
)

doctopic = m_clf.fit_transform(dtm)
topic_words = []

for topic in m_clf.components_:
    word_idx = np.argsort(topic)[::-1][0:num_top_words]
    topic_words.append([vocab[i] for i in word_idx])

doctopic = doctopic / np.sum(doctopic, axis=1, keepdims=True)
for t in range(len(topic_words)):
    print("Topic {}: {}".format(t, ','.join(topic_words[t][:5])))

EDIT

Sample dataframe (I have tried to insert as many edge cases as possible), df:

   Content
0  I like red roses as much as I like blue tulips.
1  It would be quite unusual to see red tulips, but not RED ROSES
2  It is almost impossible to find blue roses
3  I like most red flowers, but roses are my favorite.
4  Could you buy me some red roses?
5  John loves the color red. Roses are Mary's favorite flowers.

回答1:

TfidfVectorizer allows for a custom preprocessor. You can use this to make any needed adjustments.

For example, to remove all occurrences of consecutive "red" + "roses" tokens from your example corpus (case-insensitive), use:

import re
from sklearn.feature_extraction import text

cases = ["I like red roses as much as I like blue tulips.",
         "It would be quite unusual to see red tulips, but not RED ROSES",
         "It is almost impossible to find blue roses",
         "I like most red flowers, but roses are my favorite.",
         "Could you buy me some red roses?",
         "John loves the color red. Roses are Mary's favorite flowers."]

# remove_stop_phrases() is our custom preprocessing function.
def remove_stop_phrases(doc):
    # note: this regex considers "... red. Roses..." as fair game for removal.
    #       if that's not what you want, just use ["red roses"] instead.
    stop_phrases= ["red(\s?\\.?\s?)roses"]
    for phrase in stop_phrases:
        doc = re.sub(phrase, "", doc, flags=re.IGNORECASE)
    return doc

sw = text.ENGLISH_STOP_WORDS
mod_vectorizer = text.TfidfVectorizer(
    ngram_range=(2,3),
    stop_words=sw,
    norm='l2',
    min_df=1,
    preprocessor=remove_stop_phrases  # define our custom preprocessor
)

dtm = mod_vectorizer.fit_transform(cases).toarray()
vocab = np.array(mod_vectorizer.get_feature_names())

Now vocab has all red roses references removed.

print(sorted(vocab))

['Could buy',
 'It impossible',
 'It impossible blue',
 'It quite',
 'It quite unusual',
 'John loves',
 'John loves color',
 'Mary favorite',
 'Mary favorite flowers',
 'blue roses',
 'blue tulips',
 'color Mary',
 'color Mary favorite',
 'favorite flowers',
 'flowers roses',
 'flowers roses favorite',
 'impossible blue',
 'impossible blue roses',
 'like blue',
 'like blue tulips',
 'like like',
 'like like blue',
 'like red',
 'like red flowers',
 'loves color',
 'loves color Mary',
 'quite unusual',
 'quite unusual red',
 'red flowers',
 'red flowers roses',
 'red tulips',
 'roses favorite',
 'unusual red',
 'unusual red tulips']

UPDATE (per comment thread):

To pass in desired stop phrases along with custom stop words to a wrapper function, use:

desired_stop_phrases = ["red(\s?\\.?\s?)roses"]
desired_stop_words = ['Could', 'buy']

def wrapper(stop_words, stop_phrases):

    def remove_stop_phrases(doc):
        for phrase in stop_phrases:
            doc = re.sub(phrase, "", doc, flags=re.IGNORECASE)
        return doc

    sw = text.ENGLISH_STOP_WORDS.union(stop_words)
    mod_vectorizer = text.TfidfVectorizer(
        ngram_range=(2,3),
        stop_words=sw,
        norm='l2',
        min_df=1,
        preprocessor=remove_stop_phrases
    )

    dtm = mod_vectorizer.fit_transform(cases).toarray()
    vocab = np.array(mod_vectorizer.get_feature_names())

    return vocab

wrapper(desired_stop_words, desired_stop_phrases)

回答2:

You can switch out the tokenizer of the TfidfVectorizer by passing a keyword argument tokenizer (doc-src)

the original looks like this:

def build_tokenizer(self):
    """Return a function that splits a string into a sequence of tokens"""
    if self.tokenizer is not None:
        return self.tokenizer
    token_pattern = re.compile(self.token_pattern)
    return lambda doc: token_pattern.findall(doc)

So let's make a function that removes all the word combinations you don't want. First let's define the expressions you don't want:

unwanted_expressions = [('red','roses'), ('foo', 'bar')]

and the function would need to look something like this:

token_pattern_str = r"(?u)\b\w\w+\b"
def my_tokenizer(doc):
    """split a string into a sequence of tokens
    and remove some words along the way."""

    token_pattern = re.compile(token_pattern_str)
    tokens = token_pattern.findall(doc)
    for i in range(len(tokens)):
        for expr in unwanted_expressions:
            found = True
            for j, word in enumerate(expr):
                found = found and (tokens[i+j] == word)
            if found:
                tokens[i:i+len(expr)] = len(expr) * [None]
    tokens = [x for x in tokens if x is not None]
    return tokens

I have not tried this specifically out myself, but i have switched out the tokenizer before. It works well.

Good luck :)

回答3:

Before passing df to mod_vectorizer you should use something like the next:

df=["I like red roses as much as I like blue tulips.",
"It would be quite unusual to see red tulips, but not RED ROSES",
"It is almost impossible to find blue roses",
"I like most red flowers, but roses are my favorite.",
"Could you buy me some red roses?",
"John loves the color red. Roses are Mary's favorite flowers."]

df=[ i.lower() for i in df]
df=[i if 'red roses' not in i else i.replace('red roses','') for i in df]

If you are checking for more than "red roses" then replace the last line in the above with:

stop_phrases=['red roses']
def filterPhrase(data,stop_phrases):
 for i in range(len(data)):
     for x in stop_phrases:
         if x in data[i]:
             data[i]=data[i].replace(x,'')
 return data
df=filterPhrase(df, stop_phrases)

回答4:

For Pandas, you want to use List Compression

.apply(lambda x: [item for item in x if item not in stop])

来源：https://stackoverflow.com/questions/45426215/how-to-remove-stop-phrases-stop-ngrams-multi-word-strings-using-pandas-sklearn

标签

python

pandas

scikit-learn

nlp