Without getting a degree in information retrieval, I\'d like to know if there exists any algorithms for counting the frequency that words occur in a given body of text. The
You'll need not one, but several nice algorithms, along the lines of the following.
I'm sorry, I know you said you wanted to KISS, but unfortunately, your demands aren't that easy to meet. Nevertheless, there exist tools for all of this, and you should be able to just tie them together and not have to perform any task yourself, if you don't want to. If you want to perform a task yourself, I suggest you look at stemming, it's the easiest of all.
If you go with Java, combine Lucene with the OpenNLP toolkit. You will get very good results, as Lucene already has a stemmer built in and a lot of tutorial. The OpenNLP toolkit on the other hand is poorly documented, but you won't need too much out of it. You might also be interested in NLTK, written in Python.
I would say you drop your last requirement, as it involves shallow parsing and will definetly not impove your results.
Ah, btw. the exact term of that document-term-frequency-thing you were looking for is called tf-idf. It's pretty much the best way to look for document frequency for terms. In order to do it properly, you won't get around using multidimenional vector matrices.
... Yes, I know. After taking a seminar on IR, my respect for Google was even greater. After doing some stuff in IR, my respect for them fell just as quick, though.
The first part of your question doesn't sound so bad. All you basically need to do is read each word from the file (or stream w/e) and place it into a prefix tree and each time you happen upon a word that already exists you increment the value associated with it. Of course you would have an ignore list of everything you'd like left out of your calculations as well.
If you use a prefix tree you ensure that to find any word is going to O(N) where N is the maximum length of a word in your data set. The advantage of a prefix tree in this situation is that if you want to look for plurals and stemming you can check in O(M+1) if that's even possible for the word, where M is the length of the word without stem or plurality (is that a word? hehe). Once you've built your prefix tree I would reanalyze it for the stems and such and condense it down so that the root word is what holds the results.
Upon searching you could have some simple rules in place to have the match return positive in case of the root or stem or what have you.
The second part seems extremely challenging. My naive inclination would be to hold separate results for adjective-subject groupings. Use the same principles as above but just keep it separate.
Another option for the semantic analysis could be modeling each sentence as a tree of subject, verb, etc relationships (Sentence has a subject and verb, subject has a noun and adjective, etc). Once you've broken all of your text up in this way it seems like it might be fairly easy to run through and get a quick count of the different appropriate pairings that occurred.
Just some ramblings, I'm sure there are better ideas, but I love thinking about this stuff.