I have written below code using stanford nlp packages.
GenderAnnotator myGenderAnnotation = new GenderAnnotator();
myGenderAnnotation.annotate(annotation);
Though the previous answer @Sebastian Schuster is some what close to the expected, it appears to be not working as for the current versions of Standford NLP
An updated and working example of that code segment is as below.
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,parse,gender");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation document = new Annotation("Annie goes to school");
pipeline.annotate(document);
for (CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class)) {
for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
System.out.print(token.value());
System.out.print(", Gender: ");
System.out.println(token.get(CoreAnnotations.GenderAnnotation.class));
}
}
The gender annotator doesn't add the information to the text output but you can still access it through code as shown in the following snippet:
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,parse,gender");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation document = new Annotation("Annie goes to school");
pipeline.annotate(document);
for (CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class)) {
for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
System.out.print(token.value());
System.out.print(", Gender: ");
System.out.println(token.get(MachineReadingAnnotations.GenderAnnotation.class));
}
}
Output:
Annie, Gender: FEMALE
goes, Gender: null
to, Gender: null
school, Gender: null
There are a lot of approaches and one of them is outlined in nltk cookbook.
Basically you build a classifier that extract some features (first, last letter, first two, last two letters and so on) from a name and have a prediction based on these features.
import nltk
import random
def extract_features(name):
name = name.lower()
return {
'last_char': name[-1],
'last_two': name[-2:],
'last_three': name[-3:],
'first': name[0],
'first2': name[:1]
}
f_names = nltk.corpus.names.words('female.txt')
m_names = nltk.corpus.names.words('male.txt')
all_names = [(i, 'm') for i in m_names] + [(i, 'f') for i in f_names]
random.shuffle(all_names)
test_set = all_names[500:]
train_set= all_names[:500]
test_set_feat = [(extract_features(n), g) for n, g in test_set]
train_set_feat= [(extract_features(n), g) for n, g in train_set]
classifier = nltk.NaiveBayesClassifier.train(train_set_feat)
print nltk.classify.accuracy(classifier, test_set_feat)
This basic test gives you approximately 77% of accuracy.
If your named entity recognizer outputs PERSON
for a token, you might use (or build if you don't have one) a gender classifier based on first names. As an example, see the Gender Identification section from the NLTK library tutorial pages. They use the following features:
Though, I have a hunch that using character n-gram frequency---possibly up to character trigrams---will give you pretty good results.