问题
I have to perform Stemming on a text. The questions are as follows :
- Tokenize all the words given in
tc
. The word should contain alphabets or numbers or underscore. Store the tokenized list of words intw
- Convert all the words into lowercase. Store the result into the variable
tw
- Remove all the stop words from the unique set of
tw
. Store the result into the variablefw
- Stem each word present in
fw
with PorterStemmer, and store the result in the listpsw
Below is my code :
import re
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer,LancasterStemmer
pattern = r'\w+';
tw= nltk.regexp_tokenize(tc,pattern);
tw= [word.lower() for word in tw];
stop_word = set(stopwords.words('english'));
fw= [w for w in tw if not w in stop_word];
#print(sorted(filteredwords));
porter = PorterStemmer();
psw = [porter.stem(word) for word in fw];
print(sorted(psw));
My code works perfectly with all the provided testcases in hand-on but it fails only for the below test case where
tc = "I inadvertently went to See's Candy last week (I was in the mall looking for phone repair), and as it turns out, See's Candy now charges a dollar -- a full dollar -- for even the simplest of their wee confection offerings. I bought two chocolate lollipops and two chocolate-caramel-almond things. The total cost was four-something. I mean, the candies were tasty and all, but let's be real: A Snickers bar is fifty cents. After this dollar-per-candy revelation, I may not find myself wandering dreamily back into a See's Candy any time soon."
My Output is :
['almond', 'back', 'bar', 'bought', 'candi', 'candi', 'caramel', 'cent', 'charg', 'chocol', 'confect', 'cost', 'dollar', 'dreamili', 'even', 'fifti', 'find', 'four', 'full', 'inadvert', 'last', 'let', 'lollipop', 'look', 'mall', 'may', 'mean', 'offer', 'per', 'phone', 'real', 'repair', 'revel', 'see', 'simplest', 'snicker', 'someth', 'soon', 'tasti', 'thing', 'time', 'total', 'turn', 'two', 'wander', 'wee', 'week', 'went']
Expected Output is :
['almond', 'back', 'bar', 'bought', 'candi', 'candi', 'candi', 'caramel', 'cent', 'charg', 'chocol', 'confect', 'cost', 'dollar', 'dreamili', 'even', 'fifti', 'find', 'four', 'full', 'inadvert', 'last', 'let', 'lollipop', 'look', 'mall', 'may', 'mean', 'offer', 'per', 'phone', 'real', 'repair', 'revel', 'see', 'simplest', 'snicker', 'someth', 'soon', 'tasti', 'thing', 'time', 'total', 'turn', 'two', 'wander', 'wee', 'week', 'went']
The difference is the occurrence of 'Candi'
Looking help to troubleshoot the issue.
回答1:
Firstly, don't iterate through the text multiple times, see Why is my NLTK function slow when processing the DataFrame?
Do this instead and you only iterate through your data/text once:
import re
from nltk import word_tokenize, regexp_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
stop_word = set(stopwords.words('english'))
porter = PorterStemmer()
text = "I inadvertently went to See's Candy last week (I was in the mall looking for phone repair), and as it turns out, See's Candy now charges a dollar -- a full dollar -- for even the simplest of their wee confection offerings. I bought two chocolate lollipops and two chocolate-caramel-almond things. The total cost was four-something. I mean, the candies were tasty and all, but let's be real: A Snickers bar is fifty cents. After this dollar-per-candy revelation, I may not find myself wandering dreamily back into a See's Candy any time soon."
signature = [porter.stem(word.lower())
for word in regexp_tokenize(text,r'\w+')
if word.lower() not in stop_word]
Next, lets check against the expected output:
signature = [(word, porter.stem(word.lower())) for word in regexp_tokenize(text,r'\w+')]
expected = ['almond', 'back', 'bar', 'bought', 'candi', 'candi', 'candi', 'caramel', 'cent', 'charg', 'chocol', 'confect', 'cost', 'dollar', 'dreamili', 'even', 'fifti', 'find', 'four', 'full', 'inadvert', 'last', 'let', 'lollipop', 'look', 'mall', 'may', 'mean', 'offer', 'per', 'phone', 'real', 'repair', 'revel', 'see', 'simplest', 'snicker', 'someth', 'soon', 'tasti', 'thing', 'time', 'total', 'turn', 'two', 'wander', 'wee', 'week', 'went']
sorted(signature) == expected # -> False
[out]:
False
That's not a good sign, lets find which terms are missing:
# If item in signature but not in expected.
len(set(signature).difference(expected)) == 0 # -> True
# If item in expected but not in signature.
len(set(expected).difference(signature)) == 0 # -> True
In that case, lets check the counts:
print(len(signature), len(expected))
[out]:
57 49
Seems like your expected output is missing quite a few items. Checking through:
from collections import Counter
counter_signature = Counter(signature)
counter_expected = Counter(expected)
for word, count in counter_signature.items():
# If the count in expected is different.
expected_count = counter_expected[word]
if count != expected_count:
print(word, count, expected_count)
It seems like not only candi
has different count!
[out]:
see 3 1
candi 5 3
dollar 3 1
two 2 1
chocol 2 1
It looks like the signature (i.e. processed text) contains a lot more counts than the expected from the expected output in the question. So most probably the test you have is not counting things right =)
回答2:
Try using:
import re
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer,LancasterStemmer
pattern = r'\w+';
tw= nltk.regexp_tokenize(tc,pattern);
tw= [word.lower() for word in tw];
unique_tw = set(tw); #Unique Set of Tokenized words(See Your Step3)
stop_word = set(stopwords.words('english'));
fw= [w for w in unique_tw if not w in stop_word];# Remove stopwords from unique_tw
porter = PorterStemmer();
psw = [porter.stem(word) for word in fw];
print(sorted(psw));
As step 3 is: Remove all the stop words from the unique set of tw.
来源:https://stackoverflow.com/questions/62626878/why-is-the-number-of-stem-from-nltk-stemmer-outputs-different-from-expected-outp