问题
I have the following python pandas dataframe:
Question_ID | Customer_ID | Answer
1 234 The team worked very hard ...
2 234 All the teams have been working together ...
I am going to use my code to count words in the answer column. But beforehand, I want to take out the "s" from the word "teams", so that in the example above I count team: 2 instead of team:1 and teams:1.
How can I do this for all words?
回答1:
You need to use a tokenizer (for breaking a sentence into words) and lemmmatizer (for standardizing word forms), both provided by the natural language toolkit nltk
:
import nltk
wnl = nltk.WordNetLemmatizer()
[wnl.lemmatize(word) for word in nltk.wordpunct_tokenize(sentence)]
# ['All', 'the', 'team', 'have', 'been', 'working', 'together']
回答2:
use str.replace
to remove the s from any 3 or more letter word that ends in 's'
.
df.Answer.str.replace(r'(\w{2,})s\b', r'\1')
0 The team worked very hard ...
1 All the team have been working together ...
Name: Answer, dtype: object
'{2,}'
specifies 2 or more. That combined with the 's'
ensures that you'll miss 'is'
. You can set it to '{3,}'
to ensure you skip 'its'
as well.
回答3:
Try the NTLK toolkit. Specifically Stemming and Lemmatization. I have never personally used it but here you can try it out.
Here is an example of some tricky plurals,
its it's his quizzes fishes maths mathematics
becomes
it it ' s hi quizz fish math mathemat
You can see it deals with "his" (and "mathematics") poorly, but then again you could have lots of abbreviated "hellos". This is the nature of the beast.
来源:https://stackoverflow.com/questions/41227373/python-pandas-get-ride-of-plural-s-in-words-to-prepare-for-word-count