Auto Categorization of Content

孤街浪徒 提交于 2019-12-22 09:55:17

问题


I'm developing a script that extracts the messages from the message archive of a particular meetup.com group of which I'm a member - http://www.meetup.com/opencoffee/messages/archive/

The idea is to dynamically add these to a wordpress site and allow people to search messages, auto tag messages etc.

The issue I have is how best to auto categorize these messages. I would welcome any thoughts and ideas of how best to go about this and what would be the most efficient way of programming this.

Option 1

Find a source of tags by subject area such as finance, technology, business etc by using the delicious API and find related tags by subject:-

http://delicious.com/tag/finance

http://delicious.com/tag/technology

if a message contains these tags then the message is assigned to the respective category.

I believe this could work but not sure the most efficient method of scanning the message for these tags.

Option 2

Find sites that are representative of the categories I need such as ft.com, the economist for finance etc, techcrunch for technology etc and then determine what tags are being used by people to tag these sites and determine by default that those tags are how people relate to these sites and their content stack.

Option 3

Pass the message url to http://semanticproxy.com/ (part of Reuters Calais project) or use the Open Calais API. This I have tried but without much success as the variable depth of content is not always sufficient to return meaningful taxonomy.

Here is an example message that I parsed through the calais api:-

Original Message

http://www.meetup.com/opencoffee/messages/6045615/

Calais Result

http://www.mashinteractive.com/opencoffee/calais.php

SUMMARY

So That's about it. I would welcome any thoughts and ideas on methodology and tips on how best to approach the message scanning for options 1 and 2.

FYI there are approximately, 1,700 messages to date and I'm guessing I may have 10 categories with each category being defined by 20 or 30 tags.

If anyone would like to help develop a Wordpress plugin or class to do this I would be more than happy to have you on board. Bear in mind I'm not a programmer, I just tinker around the edges and pretend I am one.

Thanks in advance

Jonathan CEO

Crowd People


回答1:


You may want to check out Zemanta, which has tools and plugins (including Wordpress) for auto-tagging content, and also have a look at Common Tag, which is a vocabulary for expressing tags on content using RDFa, a semantic web standard currently indexed by some search engines.



来源:https://stackoverflow.com/questions/820296/auto-categorization-of-content

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!