I'm developing a script that extracts the messages from the message archive of a particular meetup.com group of which I'm a member - http://www.meetup.com/opencoffee/messages/archive/
The idea is to dynamically add these to a wordpress site and allow people to search messages, auto tag messages etc.
The issue I have is how best to auto categorize these messages. I would welcome any thoughts and ideas of how best to go about this and what would be the most efficient way of programming this.
Option 1
Find a source of tags by subject area such as finance, technology, business etc by using the delicious API and find related tags by subject:-
http://delicious.com/tag/finance
http://delicious.com/tag/technology
if a message contains these tags then the message is assigned to the respective category.
I believe this could work but not sure the most efficient method of scanning the message for these tags.
Option 2
Find sites that are representative of the categories I need such as ft.com, the economist for finance etc, techcrunch for technology etc and then determine what tags are being used by people to tag these sites and determine by default that those tags are how people relate to these sites and their content stack.
Option 3
Pass the message url to http://semanticproxy.com/ (part of Reuters Calais project) or use the Open Calais API. This I have tried but without much success as the variable depth of content is not always sufficient to return meaningful taxonomy.
Here is an example message that I parsed through the calais api:-
Original Message
http://www.meetup.com/opencoffee/messages/6045615/
Calais Result
http://www.mashinteractive.com/opencoffee/calais.php
SUMMARY
So That's about it. I would welcome any thoughts and ideas on methodology and tips on how best to approach the message scanning for options 1 and 2.
FYI there are approximately, 1,700 messages to date and I'm guessing I may have 10 categories with each category being defined by 20 or 30 tags.
If anyone would like to help develop a Wordpress plugin or class to do this I would be more than happy to have you on board. Bear in mind I'm not a programmer, I just tinker around the edges and pretend I am one.
Thanks in advance
Jonathan CEO
Crowd People
You may want to check out Zemanta, which has tools and plugins (including Wordpress) for auto-tagging content, and also have a look at Common Tag, which is a vocabulary for expressing tags on content using RDFa, a semantic web standard currently indexed by some search engines.
来源:https://stackoverflow.com/questions/820296/auto-categorization-of-content