Categorizing Words and Category Values

前端未结

关注

 21  1643

We were set an algorithm problem in class today, as a \"if you figure out a solution you don\'t have to do this subject\". SO of course, we all thought we will give it a go.

相关标签:

21条回答

一整个雨季

2021-01-31 06:35

Google is forbidden, but they have almost a perfect solution - Google Sets.

Because you need to unterstand the semantics of the words you need external datasources. You could try using WordNet. Or you could maybe try using Wikipedia - find the page for every word (or maybe only for the categories) and look for other words appearing on the page or linked pages.

0 讨论(0)
发布评论:

提交评论
- 加载中...
迷失自我

2021-01-31 06:36

First of all you need sample text to analyze, to get the relationship of words. A categorization with latent semantic analysis is described in Latent Semantic Analysis approaches to categorization.

A different approach would be naive bayes text categorization. Sample text with the assigned category are needed. In a learning step the program learns the different categories and the likelihood that a word occurs in a text assigned to a category, see bayes spam filtering. I don't know how well that works with single words.

0 讨论(0)
发布评论:

提交评论
- 加载中...
暖寄归人

2021-01-31 06:37

Yeah I'd go for the wordnet approach. Check this tutorial on WordNet-based semantic similarity measurement. You can query Wordnet online at princeton.edu (google it) so it should be relatively easy to code a solution for your problem. Hope this helps,

X.

0 讨论(0)
发布评论:

提交评论
- 加载中...

小鲜肉

2021-01-31 06:39

Really poor answer (demonstrates no "understanding") - but as a crazy stab you could hit google (through code) for (for example) "+Fishing +Sport", "+Fishing +Cooking" etc (i.e. cross join each word and category) - and let the google fight win! i.e. the combination with the most "hits" gets chosen...

For example (results first):

weather: fish
sport: ball
weather: hat
fashion: trousers
weather: snowball
weather: tornado

With code (TODO: add threading ;-p):

static void Main() {
    string[] words = { "fish", "ball", "hat", "trousers", "snowball","tornado" };
    string[] categories = { "sport", "fashion", "weather" };

    using(WebClient client = new WebClient()){
        foreach(string word in words) {
            var bestCategory = categories.OrderByDescending(
                cat => Rank(client, word, cat)).First();
            Console.WriteLine("{0}: {1}", bestCategory, word);
        }
    }
}

static int Rank(WebClient client, string word, string category) {
    string s = client.DownloadString("http://www.google.com/search?q=%2B" +
        Uri.EscapeDataString(word) + "+%2B" +
        Uri.EscapeDataString(category));
    var match = Regex.Match(s, @"of about \<b\>([0-9,]+)\</b\>");
    int rank = match.Success ? int.Parse(match.Groups[1].Value, NumberStyles.Any) : 0;
    Debug.WriteLine(string.Format("\t{0} / {1} : {2}", word, category, rank));
    return rank;
}

0 讨论(0)

孤街浪徒

2021-01-31 06:39
My naive approach:
1. Create a huge text file like this (read the article for inspiration)
2. For every word, scan the text and whenever you match that word, count the 'categories' that appear in N (maximum, aka radio) positions left and right of it.
3. The word is likely to belong in the category with the greatest counter.
0 讨论(0)
发布评论:

提交评论
- 加载中...
小鲜肉

2021-01-31 06:40

You could do a custom algorithm to work specifically on that data, for instance words ending in 'ing' are verbs (present participle) and could be sports.

Create a set of categorization rules like the one above and see how high an accuracy you get.

EDIT:

Steal the wikipedia database (it's free anyway) and get the list of articles under each of your ten categories. Count the occurrences of each of your 100 words in all the articles under each category, and the category with the highest 'keyword density' of that word (e.g. fishing) wins.

0 讨论(0)
发布评论:

提交评论
- 加载中...