Technique to remove common words(and their plural versions) from a string

后端 未结 3 1614
悲哀的现实
悲哀的现实 2021-02-05 08:17

I am attempting to find tags(keywords) for a recipe by parsing a long string of text. The text contains the recipe ingredients, directions and a short blurb.

Wha

相关标签:
3条回答
  • 2021-02-05 08:54

    Your problem domain is "Natural Language Processing".

    If you don't want to reinvent the wheel, use NLTK, search for stemming in the docs.

    Given that NLP is one of the hardest subjects in computer science, reinventing this wheel is a lot of work...

    0 讨论(0)
  • 2021-02-05 08:56

    I'd just do something like this:

    from nltk.corpus import stopwords
    s=set(stopwords.words('english'))
    
    txt="a long string of text about him and her"
    print filter(lambda w: not w in s,txt.split())
    

    which prints

    ['long', 'string', 'text']
    

    and in terms of complexity should be O(n) in number of words in the string, if you believe the hashed set lookup is O(1).

    FWIW, my version of NLTK defines 127 stopwords:

    'all', 'just', 'being', 'over', 'both', 'through', 'yourselves', 'its', 'before', 'herself', 'had', 'should', 'to', 'only', 'under', 'ours', 'has', 'do', 'them', 'his', 'very', 'they', 'not', 'during', 'now', 'him', 'nor', 'did', 'this', 'she', 'each', 'further', 'where', 'few', 'because', 'doing', 'some', 'are', 'our', 'ourselves', 'out', 'what', 'for', 'while', 'does', 'above', 'between', 't', 'be', 'we', 'who', 'were', 'here', 'hers', 'by', 'on', 'about', 'of', 'against', 's', 'or', 'own', 'into', 'yourself', 'down', 'your', 'from', 'her', 'their', 'there', 'been', 'whom', 'too', 'themselves', 'was', 'until', 'more', 'himself', 'that', 'but', 'don', 'with', 'than', 'those', 'he', 'me', 'myself', 'these', 'up', 'will', 'below', 'can', 'theirs', 'my', 'and', 'then', 'is', 'am', 'it', 'an', 'as', 'itself', 'at', 'have', 'in', 'any', 'if', 'again', 'no', 'when', 'same', 'how', 'other', 'which', 'you', 'after', 'most', 'such', 'why', 'a', 'off', 'i', 'yours', 'so', 'the', 'having', 'once'
    

    obviously you can provide your own set; I'm in agreement with the comment on your question that it's probably easiest (and fastest) to just provide all the variations you want to eliminate up front, unless you want to eliminate a lot more words than this but then it becomes more a question of spotting interesting ones than eliminating spurious ones.

    0 讨论(0)
  • 2021-02-05 09:06

    You ask about speed, but you should be more concerned with accuracy. Both your suggestions will make a lot of mistakes, removing either too much or too little (for example, there are a lot of words that contain the substring "at"). I second the suggestion to look into the nltk module. In fact, one of the early examples in the NLTK book involves removing common words until the most common remaining ones reveal something about the genre. You'll get not only tools, but instruction on how to go about it.

    Anyway you'll spend much longer writing your program than your computer will spend executing it, so focus on doing it well.

    0 讨论(0)
提交回复
热议问题