Technique to remove common words(and their plural versions) from a string

后端未结

关注

 3  1616

悲哀的现实 2021-02-05 08:17

I am attempting to find tags(keywords) for a recipe by parsing a long string of text. The text contains the recipe ingredients, directions and a short blurb.

Wha

3条回答

情深已故 (楼主)

2021-02-05 08:56

I'd just do something like this:

from nltk.corpus import stopwords s=set(stopwords.words('english')) txt="a long string of text about him and her" print filter(lambda w: not w in s,txt.split())

which prints

['long', 'string', 'text']

and in terms of complexity should be O(n) in number of words in the string, if you believe the hashed set lookup is O(1).

FWIW, my version of NLTK defines 127 stopwords:

'all', 'just', 'being', 'over', 'both', 'through', 'yourselves', 'its', 'before', 'herself', 'had', 'should', 'to', 'only', 'under', 'ours', 'has', 'do', 'them', 'his', 'very', 'they', 'not', 'during', 'now', 'him', 'nor', 'did', 'this', 'she', 'each', 'further', 'where', 'few', 'because', 'doing', 'some', 'are', 'our', 'ourselves', 'out', 'what', 'for', 'while', 'does', 'above', 'between', 't', 'be', 'we', 'who', 'were', 'here', 'hers', 'by', 'on', 'about', 'of', 'against', 's', 'or', 'own', 'into', 'yourself', 'down', 'your', 'from', 'her', 'their', 'there', 'been', 'whom', 'too', 'themselves', 'was', 'until', 'more', 'himself', 'that', 'but', 'don', 'with', 'than', 'those', 'he', 'me', 'myself', 'these', 'up', 'will', 'below', 'can', 'theirs', 'my', 'and', 'then', 'is', 'am', 'it', 'an', 'as', 'itself', 'at', 'have', 'in', 'any', 'if', 'again', 'no', 'when', 'same', 'how', 'other', 'which', 'you', 'after', 'most', 'such', 'why', 'a', 'off', 'i', 'yours', 'so', 'the', 'having', 'once'

obviously you can provide your own set; I'm in agreement with the comment on your question that it's probably easiest (and fastest) to just provide all the variations you want to eliminate up front, unless you want to eliminate a lot more words than this but then it becomes more a question of spotting interesting ones than eliminating spurious ones.

0 讨论(0)

查看其它3个回答

发布评论:

提交评论

加载中...

验证码

看不清?

提交回复