问题
Where could I find an exhaustive list of stop words? The one I have is quite short and it seems to be inapplicable to scientific texts.
I am creating lexical chains to extract key topics from scientific papers. The problem is that words like based
, regarding
, etc. should also be considered as stop words as they do not deliver much sense.
回答1:
You can also easily add to existing stop word lists. E.g. use the one in the NLTK toolkit:
from nltk.corpus import stopwords
and then add whatever you think is missing:
stopwords = stopwords.words('english')+["based", "regarding"]
The original NLTK list is described here.
回答2:
It is difficult to find an exhaustive list of stop words because a given word could be considered as a stop word in a given domain but it is an important word in another domain.
you could take a look at some lists of stop words:
http://blog.adlegant.com/how-to-install-nltk-corporastopwords/
http://www.lextek.com/manuals/onix/stopwords1.html
http://www.ranks.nl/stopwords
http://xpo6.com/list-of-english-stop-words/
来源:https://stackoverflow.com/questions/37701305/where-to-find-an-exhaustive-list-of-stop-words