I am using TokensRegex for rule based entity extraction. It works well but I am having trouble getting my output in the desired format. The following snippet of code gives me an
Answering my own question for those struggling with a similar issue. THe key to getting your output in the correct format lies in how you define your rules in the rules file. Here's what I changed in the rules to change the output:
Old Rule:
{ ruleType: "tokens",
pattern: (([pos:/NNP.*/ | pos:/NN.*/]+) ($LocWords)),
result: Annotate($1, ner, "LOCATION"),
}
New Rule
{ ruleType: "tokens",
pattern: (([pos:/NNP.*/ | pos:/NN.*/]+) ($LocWords)),
action: Annotate($1, ner, "LOCATION"),
result: "LOCATION"
}
How you define your result field defines the output format of your data.
Hope this helps!
I managed to get output in desired format.
Annotation document = new Annotation(<Sentence to annotate>);
//use the pipeline to annotate the document we created
pipeline.annotate(document);
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
//Note- I doesn't put environment related stuff in rule file.
Env env = TokenSequencePattern.getNewEnv();
env.setDefaultStringMatchFlags(NodePattern.CASE_INSENSITIVE);
env.setDefaultStringPatternFlags(Pattern.CASE_INSENSITIVE);
CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor
.createExtractorFromFiles(env, "test_degree.rules");
for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
List<MatchedExpression> matched = extractor.extractExpressions(sentence);
for(MatchedExpression phrase : matched){
// Print out matched text and value
System.out.println("MATCHED ENTITY: " + phrase.getText() + " VALUE: " + phrase.getValue().get());
}
}
Output:
MATCHED ENTITY: Technical Skill VALUE: SKILL
You might want to have a look at my rule file in this question.
Hope this helps!
I produced a jar of the latest build a week or so ago. Use that jar available from GitHub.
This sample code will run the rules and apply the appropriate ner tags.
package edu.stanford.nlp.examples;
import edu.stanford.nlp.util.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import java.util.*;
public class TokensRegexExampleTwo {
public static void main(String[] args) {
// set up properties
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,tokensregex");
props.setProperty("tokensregex.rules", "multi-step-per-org.rules");
props.setProperty("tokensregex.caseInsensitive", "true");
// set up pipeline
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// set up text to annotate
Annotation annotation = new Annotation("...text to annotate...");
// annotate text
pipeline.annotate(annotation);
// print out found entities
for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
System.out.println(token.word() + "\t" + token.ner());
}
}
}
}