Word frequency algorithm for natural language processing

后端 未结 8 2254
我在风中等你
我在风中等你 2020-12-12 10:03

Without getting a degree in information retrieval, I\'d like to know if there exists any algorithms for counting the frequency that words occur in a given body of text. The

8条回答
  •  醉梦人生
    2020-12-12 10:15

    Here is an example of how you might do that in Python, the concepts are similar in any language.

    >>> import urllib2, string
    >>> devilsdict = urllib2.urlopen('http://www.gutenberg.org/files/972/972.txt').read()
    >>> workinglist = devilsdict.split()
    >>> cleanlist = [item.strip(string.punctuation) for item in workinglist]
    >>> results = {}
    >>> skip = {'a':'', 'the':'', 'an':''}
    >>> for item in cleanlist:
          if item not in skip:
            try:
              results[item] += 1
            except KeyError:
              results[item] = 1
    
    >>> results
    {'': 17, 'writings': 3, 'foul': 1, 'Sugar': 1, 'four': 8, 'Does': 1, "friend's": 1, 'hanging': 4, 'Until': 1, 'marching': 2 ...
    

    The first line just gets libraries that help with parts of the problem, as in the second line, where urllib2 downloads a copy of Ambrose Bierce's "Devil's Dictionary" The next lines make a list of all the words in the text, without punctuation. Then you create a hash table, which in this case is like a list of unique words associated with a number. The for loop goes over each word in the Bierce book, if there is already a record of that word in the table, each new occurrence adds one to the value associated with that word in the table; if the word hasn't appeared yet, it gets added to the table, with a value of 1 (meaning one occurrence.) For the cases you are talking about, you would want to pay much more attention to detail, for example using capitalization to help identify proper nouns only in the middle of sentences, etc., this is very rough but expresses the concept.

    To get into the stemming and pluralization stuff, experiment, then look into 3rd party work, I have enjoyed parts of the NLTK, which is an academic open source project, also in python.

提交回复
热议问题