Word frequency algorithm for natural language processing

后端未结

关注

 8  2255

Without getting a degree in information retrieval, I\'d like to know if there exists any algorithms for counting the frequency that words occur in a given body of text. The

相关标签:

8条回答

执笔经年

2020-12-12 10:14

The algorithm you just described it. A program that does it out of the box with a big button saying "Do it"... I don't know.

But let me be constructive. I recommend you this book Programming Collective Intelligence. Chapters 3 and 4 contain very pragmatic examples (really, no complex theories, just examples).

0 讨论(0)
发布评论:

提交评论
- 加载中...
醉梦人生

2020-12-12 10:15
Here is an example of how you might do that in Python, the concepts are similar in any language.
```
>>> import urllib2, string
>>> devilsdict = urllib2.urlopen('http://www.gutenberg.org/files/972/972.txt').read()
>>> workinglist = devilsdict.split()
>>> cleanlist = [item.strip(string.punctuation) for item in workinglist]
>>> results = {}
>>> skip = {'a':'', 'the':'', 'an':''}
>>> for item in cleanlist:
      if item not in skip:
        try:
          results[item] += 1
        except KeyError:
          results[item] = 1

>>> results
{'': 17, 'writings': 3, 'foul': 1, 'Sugar': 1, 'four': 8, 'Does': 1, "friend's": 1, 'hanging': 4, 'Until': 1, 'marching': 2 ...
```
The first line just gets libraries that help with parts of the problem, as in the second line, where urllib2 downloads a copy of Ambrose Bierce's "Devil's Dictionary" The next lines make a list of all the words in the text, without punctuation. Then you create a hash table, which in this case is like a list of unique words associated with a number. The for loop goes over each word in the Bierce book, if there is already a record of that word in the table, each new occurrence adds one to the value associated with that word in the table; if the word hasn't appeared yet, it gets added to the table, with a value of 1 (meaning one occurrence.) For the cases you are talking about, you would want to pay much more attention to detail, for example using capitalization to help identify proper nouns only in the middle of sentences, etc., this is very rough but expresses the concept.

To get into the stemming and pluralization stuff, experiment, then look into 3rd party work, I have enjoyed parts of the NLTK, which is an academic open source project, also in python.
0 讨论(0)
发布评论:

提交评论
- 加载中...
执念已碎

2020-12-12 10:23

U can use the worldnet dictionary to the get the basic information of the question keyword like its past of speech, extract synonym, u can also can do the same for your document to create the index for it. then you can easily match the keyword with index file and rank the document. then summerize it.

0 讨论(0)
发布评论:

提交评论
- 加载中...
耶瑟儿～

2020-12-12 10:26
Everything what you have listed is handled well by spacy.
1. Ignore some words - use stop words
2. Extract subject - use part of speech tagging to identify it (works out of the box). After a sentence is parsed, find "ROOT" - the main verb of the sentence. By navigating the parse tree you will find a noun that relates to this verb. It will be the subject.
3. Ignore hyphenation - their tokenizer handles hyphens in most cases. It can be easily extended to handle more special cases.
If the list of topics is pre-determined and not huge, you may even go further: build a classification model that will predict the topic. Let's say you have 10 subjects. You collect sample sentences or texts. You load them into another product: prodigy. Using it's great interface you quickly assign subjects to the samples. And finally, using the categorized samples you train the spacy model to predict the subject of the texts or sentences.
0 讨论(0)
发布评论:

提交评论
- 加载中...
情书的邮戳

2020-12-12 10:31

I wrote a full program to do just this a while back. I can upload a demo later when I get home.

Here is a the code (asp.net/c#): http://naspinski.net/post/Findingcounting-Keywords-out-of-a-Text-Document.aspx

0 讨论(0)
发布评论:

提交评论
- 加载中...
南旧

2020-12-12 10:32
Welcome to the world of NLP ^_^

All you need is a little basic knowledge and some tools.

There are already tools that will tell you if a word in a sentence is a noun, adjective or verb. They are called part-of-speech taggers. Typically, they take plaintext English as input, and output the word, its base form, and the part-of-speech. Here is the output of a popular UNIX part-of-speech tagger on the first sentence of your post:
```
$ echo "Without getting a degree in information retrieval, I'd like to know if there exists any algorithms for counting the frequency that words occur in a given body of text." | tree-tagger-english 
# Word  POS     surface form
Without IN  without
getting VVG get
a   DT  a
degree  NN  degree
in  IN  in
information NN  information
retrieval   NN  retrieval
,   ,   ,
I   PP  I
'd  MD  will
like    VV  like
to  TO  to
know    VV  know
if  IN  if
there   EX  there
exists  VVZ exist
any DT  any
algorithms  NNS algorithm
for IN  for
counting    VVG count
the DT  the
frequency   NN  frequency
that    IN/that that
words   NNS word
occur   VVP occur
in  IN  in
a   DT  a
given   VVN give
body    NN  body
of  IN  of
text    NN  text
.   SENT    .
```
As you can see, it identified "algorithms" as being the plural form (NNS) of "algorithm" and "exists" as being a conjugation (VBZ) of "exist." It also identified "a" and "the" as "determiners (DT)" -- another word for article. As you can see, the POS tagger also tokenized the punctuation.

To do everything but the last point on your list, you just need to run the text through a POS tagger, filter out the categories that don't interest you (determiners, pronouns, etc.) and count the frequencies of the base forms of the words.

Here are some popular POS taggers:

TreeTagger (binary only: Linux, Solaris, OS-X)
GENIA Tagger (C++: compile your self)
Stanford POS Tagger (Java)

To do the last thing on your list, you need more than just word-level information. An easy way to start is by counting sequences of words rather than just words themselves. These are called n-grams. A good place to start is UNIX for Poets. If you are willing to invest in a book on NLP, I would recommend Foundations of Statistical Natural Language Processing.
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页