gender identification in natural language processing

后端未结

关注

 4  1624

I have written below code using stanford nlp packages.

GenderAnnotator myGenderAnnotation = new GenderAnnotator();
myGenderAnnotation.annotate(annotation);


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  感动是毒        
                
              
                            
                2021-02-11 03:14
              
            
            
                                                                       
Though the previous answer @Sebastian Schuster is some what close to the expected, it appears to be not working as for the current versions of Standford NLP

An updated and working example of that code segment is as below.

Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,parse,gender");

StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

Annotation document = new Annotation("Annie goes to school");

pipeline.annotate(document);

for (CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class)) {
  for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
    System.out.print(token.value());
    System.out.print(", Gender: ");
    System.out.println(token.get(CoreAnnotations.GenderAnnotation.class));
  }
}

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  借酒劲吻你        
                
              
                            
                2021-02-11 03:19
              
            
            
                                                                       
The gender annotator doesn't add the information to the text output but you can still access it through code as shown in the following snippet:

Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,parse,gender");

StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

Annotation document = new Annotation("Annie goes to school");

pipeline.annotate(document);

for (CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class)) {
  for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
    System.out.print(token.value());
    System.out.print(", Gender: ");
    System.out.println(token.get(MachineReadingAnnotations.GenderAnnotation.class));
  }
}


Output:

Annie, Gender: FEMALE
goes, Gender: null
to, Gender: null
school, Gender: null

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  余生分开走        
                
              
                            
                2021-02-11 03:29
              
            
            
                                                                       
There are a lot of approaches and one of them is outlined in nltk cookbook.

Basically you build a classifier that extract some features (first, last letter, first two, last two letters and so on) from a name and have a prediction based on these features.

import nltk
import random

def extract_features(name):
    name = name.lower()
    return {
        'last_char': name[-1],
        'last_two': name[-2:],
        'last_three': name[-3:],
        'first': name[0],
        'first2': name[:1]
    }

f_names = nltk.corpus.names.words('female.txt')
m_names = nltk.corpus.names.words('male.txt')

all_names = [(i, 'm') for i in m_names] + [(i, 'f') for i in f_names]
random.shuffle(all_names)

test_set = all_names[500:]
train_set= all_names[:500]

test_set_feat = [(extract_features(n), g) for n, g in test_set]
train_set_feat= [(extract_features(n), g) for n, g in train_set]

classifier = nltk.NaiveBayesClassifier.train(train_set_feat)

print nltk.classify.accuracy(classifier, test_set_feat)


This basic test gives you approximately 77% of accuracy.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  陌清茗        
                
              
                            
                2021-02-11 03:36
              
            
            
                                                                       
If your named entity recognizer outputs PERSON for a token, you might use (or build if you don't have one) a gender classifier based on first names. As an example, see the Gender Identification section from the NLTK library tutorial pages. They use the following features:


Last letter of name.
First letter of name.
Length of name (number of characters).
Character unigram presence (boolean whether a character is in the name).


Though, I have a hunch that using character n-gram frequency---possibly up to character trigrams---will give you pretty good results.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复