gender identification in natural language processing

后端 未结 4 1627
暖寄归人
暖寄归人 2021-02-11 03:04

I have written below code using stanford nlp packages.

GenderAnnotator myGenderAnnotation = new GenderAnnotator();
myGenderAnnotation.annotate(annotation);
         


        
4条回答
  •  余生分开走
    2021-02-11 03:29

    There are a lot of approaches and one of them is outlined in nltk cookbook.

    Basically you build a classifier that extract some features (first, last letter, first two, last two letters and so on) from a name and have a prediction based on these features.

    import nltk
    import random
    
    def extract_features(name):
        name = name.lower()
        return {
            'last_char': name[-1],
            'last_two': name[-2:],
            'last_three': name[-3:],
            'first': name[0],
            'first2': name[:1]
        }
    
    f_names = nltk.corpus.names.words('female.txt')
    m_names = nltk.corpus.names.words('male.txt')
    
    all_names = [(i, 'm') for i in m_names] + [(i, 'f') for i in f_names]
    random.shuffle(all_names)
    
    test_set = all_names[500:]
    train_set= all_names[:500]
    
    test_set_feat = [(extract_features(n), g) for n, g in test_set]
    train_set_feat= [(extract_features(n), g) for n, g in train_set]
    
    classifier = nltk.NaiveBayesClassifier.train(train_set_feat)
    
    print nltk.classify.accuracy(classifier, test_set_feat)
    

    This basic test gives you approximately 77% of accuracy.

提交回复
热议问题