gender identification in natural language processing

后端 未结 4 1624
暖寄归人
暖寄归人 2021-02-11 03:04

I have written below code using stanford nlp packages.

GenderAnnotator myGenderAnnotation = new GenderAnnotator();
myGenderAnnotation.annotate(annotation);


        
相关标签:
4条回答
  • 2021-02-11 03:14

    Though the previous answer @Sebastian Schuster is some what close to the expected, it appears to be not working as for the current versions of Standford NLP

    An updated and working example of that code segment is as below.

    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit,pos,parse,gender");
    
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    
    Annotation document = new Annotation("Annie goes to school");
    
    pipeline.annotate(document);
    
    for (CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class)) {
      for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
        System.out.print(token.value());
        System.out.print(", Gender: ");
        System.out.println(token.get(CoreAnnotations.GenderAnnotation.class));
      }
    }
    
    0 讨论(0)
  • 2021-02-11 03:19

    The gender annotator doesn't add the information to the text output but you can still access it through code as shown in the following snippet:

    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit,pos,parse,gender");
    
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    
    Annotation document = new Annotation("Annie goes to school");
    
    pipeline.annotate(document);
    
    for (CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class)) {
      for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
        System.out.print(token.value());
        System.out.print(", Gender: ");
        System.out.println(token.get(MachineReadingAnnotations.GenderAnnotation.class));
      }
    }
    

    Output:

    Annie, Gender: FEMALE
    goes, Gender: null
    to, Gender: null
    school, Gender: null
    
    0 讨论(0)
  • 2021-02-11 03:29

    There are a lot of approaches and one of them is outlined in nltk cookbook.

    Basically you build a classifier that extract some features (first, last letter, first two, last two letters and so on) from a name and have a prediction based on these features.

    import nltk
    import random
    
    def extract_features(name):
        name = name.lower()
        return {
            'last_char': name[-1],
            'last_two': name[-2:],
            'last_three': name[-3:],
            'first': name[0],
            'first2': name[:1]
        }
    
    f_names = nltk.corpus.names.words('female.txt')
    m_names = nltk.corpus.names.words('male.txt')
    
    all_names = [(i, 'm') for i in m_names] + [(i, 'f') for i in f_names]
    random.shuffle(all_names)
    
    test_set = all_names[500:]
    train_set= all_names[:500]
    
    test_set_feat = [(extract_features(n), g) for n, g in test_set]
    train_set_feat= [(extract_features(n), g) for n, g in train_set]
    
    classifier = nltk.NaiveBayesClassifier.train(train_set_feat)
    
    print nltk.classify.accuracy(classifier, test_set_feat)
    

    This basic test gives you approximately 77% of accuracy.

    0 讨论(0)
  • 2021-02-11 03:36

    If your named entity recognizer outputs PERSON for a token, you might use (or build if you don't have one) a gender classifier based on first names. As an example, see the Gender Identification section from the NLTK library tutorial pages. They use the following features:

    • Last letter of name.
    • First letter of name.
    • Length of name (number of characters).
    • Character unigram presence (boolean whether a character is in the name).

    Though, I have a hunch that using character n-gram frequency---possibly up to character trigrams---will give you pretty good results.

    0 讨论(0)
提交回复
热议问题