问题
I am cleaning text and then passing it to the CountVectorizer function to give me a count of how many times each word appears in the text. The problem is that it is treating 10,000x as two words (10 and 000x). Similarly for 5.00 it is treating 5 and 00 as two different words.
I have tried the following:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
corpus=["userna lightning strike megawaysnew release there's many
ways win lightning strike megaways. start epic adventure today, seek
mystery symbols, re-spins wild multipliers, mega spins gamble lead wins
10,000x bet!"]
analyzer = CountVectorizer().build_analyzer()
vectorizer = CountVectorizer()
result = vectorizer.fit_transform(corpus).todense()
cols = vectorizer.get_feature_names()
res_df45 = pd.DataFrame(result, columns = cols)
In the data frame, both "10" and "000x" are given a count of 1 but I need them to be treated as one word (10,000x). How can I do this?
回答1:
The default regex pattern the tokenizer is using for the token_pattern parameter is:
token_pattern='(?u)\\b\\w\\w+\\b'
So a word is defined by a \b
word boundary at the beginning and the end with \w\w+
one alphanumeric character followed by one or more alphanumeric characters between the boundaries. To interpret the regex, the backslashes have to be escaped by \\
.
So you could change the token pattern to:
token_pattern='\\b(\\w+[\\.,]?\\w+)\\b'
Explanation: [\\.,]?
allows for the optional appearance of a .
or ,
. The regex for the first appearing alphanumeric character \w
has to be extended to \w+
to match numbers with more than one digit before the punctuation.
For your slightly adjusted example:
corpus=["I am userna lightning strike 2.5 release re-spins there's many 10,000x bet in NA!"]
analyzer = CountVectorizer().build_analyzer()
vectorizer = CountVectorizer(token_pattern='\\b(\\w+[\\.,]?\\w+)\\b')
result = vectorizer.fit_transform(corpus).todense()
cols = vectorizer.get_feature_names()
print(pd.DataFrame(result, columns = cols))
Output:
10,000x 2.5 am bet in lightning many na re release spins strike there userna
0 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Alternatively you could modify your input text, e.g. by replacing the decimal point .
with underscore _
and removing commas standing between digits.
import re
corpus = ["I am userna lightning strike 2.5 release re-spins there's many 10,000x bet in NA!"]
for i in range(len(corpus)):
corpus[i] = re.sub("(\d+)\.(\d+)", "\\1_\\2", corpus[i])
corpus[i] = re.sub("(\d+),(\d+)", "\\1\\2", corpus[i])
analyzer = CountVectorizer().build_analyzer()
vectorizer = CountVectorizer()
result = vectorizer.fit_transform(corpus).todense()
cols = vectorizer.get_feature_names()
print(pd.DataFrame(result, columns = cols))
Output:
10000x 2_5 am bet in lightning many na re release spins strike there userna
0 1 1 1 1 1 1 1 1 1 1 1 1 1 1
来源:https://stackoverflow.com/questions/57325870/how-to-treat-number-with-decimals-or-with-commas-as-one-word-in-countvectorizer