Technique to remove common words(and their plural versions) from a string

后端未结

关注

 3  1614

I am attempting to find tags(keywords) for a recipe by parsing a long string of text. The text contains the recipe ingredients, directions and a short blurb.

Wha

相关标签:

3条回答

青春惊慌失措

2021-02-05 08:54

Your problem domain is "Natural Language Processing".

If you don't want to reinvent the wheel, use NLTK, search for stemming in the docs.

Given that NLP is one of the hardest subjects in computer science, reinventing this wheel is a lot of work...

0 讨论(0)

发布评论:

提交评论

加载中...

情深已故

2021-02-05 08:56

I'd just do something like this:

from nltk.corpus import stopwords s=set(stopwords.words('english')) txt="a long string of text about him and her" print filter(lambda w: not w in s,txt.split())

which prints

['long', 'string', 'text']

and in terms of complexity should be O(n) in number of words in the string, if you believe the hashed set lookup is O(1).

FWIW, my version of NLTK defines 127 stopwords:

'all', 'just', 'being', 'over', 'both', 'through', 'yourselves', 'its', 'before', 'herself', 'had', 'should', 'to', 'only', 'under', 'ours', 'has', 'do', 'them', 'his', 'very', 'they', 'not', 'during', 'now', 'him', 'nor', 'did', 'this', 'she', 'each', 'further', 'where', 'few', 'because', 'doing', 'some', 'are', 'our', 'ourselves', 'out', 'what', 'for', 'while', 'does', 'above', 'between', 't', 'be', 'we', 'who', 'were', 'here', 'hers', 'by', 'on', 'about', 'of', 'against', 's', 'or', 'own', 'into', 'yourself', 'down', 'your', 'from', 'her', 'their', 'there', 'been', 'whom', 'too', 'themselves', 'was', 'until', 'more', 'himself', 'that', 'but', 'don', 'with', 'than', 'those', 'he', 'me', 'myself', 'these', 'up', 'will', 'below', 'can', 'theirs', 'my', 'and', 'then', 'is', 'am', 'it', 'an', 'as', 'itself', 'at', 'have', 'in', 'any', 'if', 'again', 'no', 'when', 'same', 'how', 'other', 'which', 'you', 'after', 'most', 'such', 'why', 'a', 'off', 'i', 'yours', 'so', 'the', 'having', 'once'

obviously you can provide your own set; I'm in agreement with the comment on your question that it's probably easiest (and fastest) to just provide all the variations you want to eliminate up front, unless you want to eliminate a lot more words than this but then it becomes more a question of spotting interesting ones than eliminating spurious ones.

0 讨论(0)

发布评论:

提交评论

加载中...

小鲜肉

2021-02-05 09:06

You ask about speed, but you should be more concerned with accuracy. Both your suggestions will make a lot of mistakes, removing either too much or too little (for example, there are a lot of words that contain the substring "at"). I second the suggestion to look into the nltk module. In fact, one of the early examples in the NLTK book involves removing common words until the most common remaining ones reveal something about the genre. You'll get not only tools, but instruction on how to go about it.

Anyway you'll spend much longer writing your program than your computer will spend executing it, so focus on doing it well.

0 讨论(0)

发布评论:

提交评论

加载中...

验证码

看不清?

提交回复