问题
I do not have a formal background in Natural Language Processing was wondering if someone from the NLP side can shed some light on this. I am playing around with the NLTK library and I was specifically looking into the stopwords function provided by this package:
In [80]: nltk.corpus.stopwords.words('english')
Out[80]:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']
What I don't understand is, why is the word "not" present? Isn't that necessary to determine the sentiment inside a sentence? For instance, a sentence like this:
I am not sure what the problem is.
is totally different once the stopword not
is removed changing the meaning of the sentence to its opposite (I am sure what the problem is
). If that is the case, is there a set of rules that I am missing on when not to use these stopwords?
回答1:
The concept of stop word list does not have a universal meaning and depends on what you want to do. If you have a task where you need to understand the polarity, sentiment or a similar characteristic of a phrase and if your method depends on detecting negation (like in your example), obviously you shouldn't remove "not" as a stop word (note that you may still want to remove other very common unrelated words which would constitute your new stop word list).
However, to answer your question, most of the sentiment analysis methods are very superficial. They look for emotion/sentiment-laden words, and -- most of the time -- they do not attempt a deep analysis of the sentence.
As an another example where you would like to keep the stop words: if you are trying to classify the documents according to their authors (authorship attribution) or carrying out stylometrics, you should definitely keep these functional words as they characterize a big part of the style and the discourse.
However, for many other kinds of analyses (e.g. word space models, document similarity, search, etc.) removing very common, functional words makes sense both computationally (you process fewer words) and in some cases practically (you may even get better results with the stop words removed). If I'm trying to understand the context in which a specific word is used very often, I'd like to see the content words, not the functional words.
来源:https://stackoverflow.com/questions/6482046/why-are-these-words-considered-stopwords