Categorizing Words and Category Values

前端 未结 21 1587
温柔的废话
温柔的废话 2021-01-31 05:49

We were set an algorithm problem in class today, as a \"if you figure out a solution you don\'t have to do this subject\". SO of course, we all thought we will give it a go.

相关标签:
21条回答
  • 2021-01-31 06:35

    Google is forbidden, but they have almost a perfect solution - Google Sets.

    Because you need to unterstand the semantics of the words you need external datasources. You could try using WordNet. Or you could maybe try using Wikipedia - find the page for every word (or maybe only for the categories) and look for other words appearing on the page or linked pages.

    0 讨论(0)
  • 2021-01-31 06:36

    First of all you need sample text to analyze, to get the relationship of words. A categorization with latent semantic analysis is described in Latent Semantic Analysis approaches to categorization.

    A different approach would be naive bayes text categorization. Sample text with the assigned category are needed. In a learning step the program learns the different categories and the likelihood that a word occurs in a text assigned to a category, see bayes spam filtering. I don't know how well that works with single words.

    0 讨论(0)
  • 2021-01-31 06:37

    Yeah I'd go for the wordnet approach. Check this tutorial on WordNet-based semantic similarity measurement. You can query Wordnet online at princeton.edu (google it) so it should be relatively easy to code a solution for your problem. Hope this helps,

    X.

    0 讨论(0)
  • 2021-01-31 06:39

    Really poor answer (demonstrates no "understanding") - but as a crazy stab you could hit google (through code) for (for example) "+Fishing +Sport", "+Fishing +Cooking" etc (i.e. cross join each word and category) - and let the google fight win! i.e. the combination with the most "hits" gets chosen...

    For example (results first):

    weather: fish
    sport: ball
    weather: hat
    fashion: trousers
    weather: snowball
    weather: tornado
    

    With code (TODO: add threading ;-p):

    static void Main() {
        string[] words = { "fish", "ball", "hat", "trousers", "snowball","tornado" };
        string[] categories = { "sport", "fashion", "weather" };
    
        using(WebClient client = new WebClient()){
            foreach(string word in words) {
                var bestCategory = categories.OrderByDescending(
                    cat => Rank(client, word, cat)).First();
                Console.WriteLine("{0}: {1}", bestCategory, word);
            }
        }
    }
    
    static int Rank(WebClient client, string word, string category) {
        string s = client.DownloadString("http://www.google.com/search?q=%2B" +
            Uri.EscapeDataString(word) + "+%2B" +
            Uri.EscapeDataString(category));
        var match = Regex.Match(s, @"of about \<b\>([0-9,]+)\</b\>");
        int rank = match.Success ? int.Parse(match.Groups[1].Value, NumberStyles.Any) : 0;
        Debug.WriteLine(string.Format("\t{0} / {1} : {2}", word, category, rank));
        return rank;
    }
    
    0 讨论(0)
  • 2021-01-31 06:39

    My naive approach:

    1. Create a huge text file like this (read the article for inspiration)
    2. For every word, scan the text and whenever you match that word, count the 'categories' that appear in N (maximum, aka radio) positions left and right of it.
    3. The word is likely to belong in the category with the greatest counter.
    0 讨论(0)
  • 2021-01-31 06:40

    You could do a custom algorithm to work specifically on that data, for instance words ending in 'ing' are verbs (present participle) and could be sports.

    Create a set of categorization rules like the one above and see how high an accuracy you get.

    EDIT:

    Steal the wikipedia database (it's free anyway) and get the list of articles under each of your ten categories. Count the occurrences of each of your 100 words in all the articles under each category, and the category with the highest 'keyword density' of that word (e.g. fishing) wins.

    0 讨论(0)
提交回复
热议问题