Technique to remove common words(and their plural versions) from a string

后端 未结 3 1603
悲哀的现实
悲哀的现实 2021-02-05 08:17

I am attempting to find tags(keywords) for a recipe by parsing a long string of text. The text contains the recipe ingredients, directions and a short blurb.

Wha

3条回答
  •  小鲜肉
    小鲜肉 (楼主)
    2021-02-05 09:06

    You ask about speed, but you should be more concerned with accuracy. Both your suggestions will make a lot of mistakes, removing either too much or too little (for example, there are a lot of words that contain the substring "at"). I second the suggestion to look into the nltk module. In fact, one of the early examples in the NLTK book involves removing common words until the most common remaining ones reveal something about the genre. You'll get not only tools, but instruction on how to go about it.

    Anyway you'll spend much longer writing your program than your computer will spend executing it, so focus on doing it well.

提交回复
热议问题