gender identification in natural language processing

后端 未结 2 660
被撕碎了的回忆
被撕碎了的回忆 2021-02-11 02:35

I have written below code using stanford nlp packages.

GenderAnnotator myGenderAnnotation = new GenderAnnotator();
myGenderAnnotation.annotate(annotation);


        
相关标签:
2条回答
  • 2021-02-11 03:16

    If your named entity recognizer outputs PERSON for a token, you might use (or build if you don't have one) a gender classifier based on first names. As an example, see the Gender Identification section from the NLTK library tutorial pages. They use the following features:

    • Last letter of name.
    • First letter of name.
    • Length of name (number of characters).
    • Character unigram presence (boolean whether a character is in the name).

    Though, I have a hunch that using character n-gram frequency---possibly up to character trigrams---will give you pretty good results.

    0 讨论(0)
  • 2021-02-11 03:22

    There are a lot of approaches and one of them is outlined in nltk cookbook.

    Basically you build a classifier that extract some features (first, last letter, first two, last two letters and so on) from a name and have a prediction based on these features.

    import nltk
    import random
    
    def extract_features(name):
        name = name.lower()
        return {
            'last_char': name[-1],
            'last_two': name[-2:],
            'last_three': name[-3:],
            'first': name[0],
            'first2': name[:1]
        }
    
    f_names = nltk.corpus.names.words('female.txt')
    m_names = nltk.corpus.names.words('male.txt')
    
    all_names = [(i, 'm') for i in m_names] + [(i, 'f') for i in f_names]
    random.shuffle(all_names)
    
    test_set = all_names[500:]
    train_set= all_names[:500]
    
    test_set_feat = [(extract_features(n), g) for n, g in test_set]
    train_set_feat= [(extract_features(n), g) for n, g in train_set]
    
    classifier = nltk.NaiveBayesClassifier.train(train_set_feat)
    
    print nltk.classify.accuracy(classifier, test_set_feat)
    

    This basic test gives you approximately 77% of accuracy.

    0 讨论(0)
提交回复
热议问题