Technique to remove common words(and their plural versions) from a string

后端 未结 3 1616
悲哀的现实
悲哀的现实 2021-02-05 08:17

I am attempting to find tags(keywords) for a recipe by parsing a long string of text. The text contains the recipe ingredients, directions and a short blurb.

Wha

3条回答
  •  情深已故
    2021-02-05 08:56

    I'd just do something like this:

    from nltk.corpus import stopwords
    s=set(stopwords.words('english'))
    
    txt="a long string of text about him and her"
    print filter(lambda w: not w in s,txt.split())
    

    which prints

    ['long', 'string', 'text']
    

    and in terms of complexity should be O(n) in number of words in the string, if you believe the hashed set lookup is O(1).

    FWIW, my version of NLTK defines 127 stopwords:

    'all', 'just', 'being', 'over', 'both', 'through', 'yourselves', 'its', 'before', 'herself', 'had', 'should', 'to', 'only', 'under', 'ours', 'has', 'do', 'them', 'his', 'very', 'they', 'not', 'during', 'now', 'him', 'nor', 'did', 'this', 'she', 'each', 'further', 'where', 'few', 'because', 'doing', 'some', 'are', 'our', 'ourselves', 'out', 'what', 'for', 'while', 'does', 'above', 'between', 't', 'be', 'we', 'who', 'were', 'here', 'hers', 'by', 'on', 'about', 'of', 'against', 's', 'or', 'own', 'into', 'yourself', 'down', 'your', 'from', 'her', 'their', 'there', 'been', 'whom', 'too', 'themselves', 'was', 'until', 'more', 'himself', 'that', 'but', 'don', 'with', 'than', 'those', 'he', 'me', 'myself', 'these', 'up', 'will', 'below', 'can', 'theirs', 'my', 'and', 'then', 'is', 'am', 'it', 'an', 'as', 'itself', 'at', 'have', 'in', 'any', 'if', 'again', 'no', 'when', 'same', 'how', 'other', 'which', 'you', 'after', 'most', 'such', 'why', 'a', 'off', 'i', 'yours', 'so', 'the', 'having', 'once'
    

    obviously you can provide your own set; I'm in agreement with the comment on your question that it's probably easiest (and fastest) to just provide all the variations you want to eliminate up front, unless you want to eliminate a lot more words than this but then it becomes more a question of spotting interesting ones than eliminating spurious ones.

提交回复
热议问题