Getting output in the desired format using TokenRegex

后端 未结 3 900
执笔经年
执笔经年 2021-01-24 06:31

I am using TokensRegex for rule based entity extraction. It works well but I am having trouble getting my output in the desired format. The following snippet of code gives me an

相关标签:
3条回答
  • 2021-01-24 06:56

    Answering my own question for those struggling with a similar issue. THe key to getting your output in the correct format lies in how you define your rules in the rules file. Here's what I changed in the rules to change the output:

    Old Rule:

    {    ruleType: "tokens",
         pattern: (([pos:/NNP.*/ | pos:/NN.*/]+) ($LocWords)),
         result: Annotate($1, ner, "LOCATION"),
    
    }
    

    New Rule

    {    ruleType: "tokens",
         pattern: (([pos:/NNP.*/ | pos:/NN.*/]+) ($LocWords)),
         action: Annotate($1, ner, "LOCATION"),
         result: "LOCATION"
    
    }
    

    How you define your result field defines the output format of your data.

    Hope this helps!

    0 讨论(0)
  • 2021-01-24 07:01

    I managed to get output in desired format.

    Annotation document = new Annotation(<Sentence to annotate>);
    
    //use the pipeline to annotate the document we created
    pipeline.annotate(document);
    List<CoreMap> sentences = document.get(SentencesAnnotation.class);
    
    //Note- I doesn't put environment related stuff in rule file.
    Env env = TokenSequencePattern.getNewEnv();
    env.setDefaultStringMatchFlags(NodePattern.CASE_INSENSITIVE);
    env.setDefaultStringPatternFlags(Pattern.CASE_INSENSITIVE);
    
    
    CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor
          .createExtractorFromFiles(env, "test_degree.rules");
    
    for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
          List<MatchedExpression> matched = extractor.extractExpressions(sentence);
          for(MatchedExpression phrase : matched){
          // Print out matched text and value
          System.out.println("MATCHED ENTITY: " + phrase.getText() + " VALUE: " + phrase.getValue().get());
          }
        }
    

    Output:

    MATCHED ENTITY: Technical Skill VALUE: SKILL

    You might want to have a look at my rule file in this question.

    Hope this helps!

    0 讨论(0)
  • 2021-01-24 07:02
    1. I produced a jar of the latest build a week or so ago. Use that jar available from GitHub.

    2. This sample code will run the rules and apply the appropriate ner tags.

      package edu.stanford.nlp.examples;
      
      import edu.stanford.nlp.util.*;
      import edu.stanford.nlp.ling.*;
      import edu.stanford.nlp.pipeline.*;
      
      import java.util.*;
      
      
      public class TokensRegexExampleTwo {
      
        public static void main(String[] args) {
      
          // set up properties
          Properties props = new Properties();
          props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,tokensregex");
          props.setProperty("tokensregex.rules", "multi-step-per-org.rules");
          props.setProperty("tokensregex.caseInsensitive", "true");
      
          // set up pipeline
          StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
      
          // set up text to annotate
          Annotation annotation = new Annotation("...text to annotate...");
      
          // annotate text
          pipeline.annotate(annotation);
      
          // print out found entities
          for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
            for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
              System.out.println(token.word() + "\t" + token.ner());
            }
          }
        }
      }
      
    0 讨论(0)
提交回复
热议问题