how do I create my own training corpus for stanford tagger?

后端 未结 4 1481
你的背包
你的背包 2021-02-05 13:20

I have to analyze informal english text with lots of short hands and local lingo. Hence I was thinking of creating the model for the stanford tagger.

How do i create my

相关标签:
4条回答
  • 2021-02-05 13:45

    I tried: java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] \ -train trainFilesPath fileRange -saveToSerializedFile serializedGrammarFilename

    But I had the error:

    Error: Could not find or load main class edu.stanford.nlp.parser.lexparser.LexicalizedParser

    0 讨论(0)
  • 2021-02-05 13:54

    Essentially, the texts that you format for the training process should have one token on each line, followed by a tab, followed by an identifier. The identifier may be something like "LOC" for location, "COR" for corporation, or "0" for non-entity tokens. E.g.

    I     0
    left     0
    my     0
    heart     0
    in     0
    Kansas     LOC
    City     LOC
    .     0
    

    When our team trained a series of classifiers, we fed each a training file formatted like this with roughly 180,000 tokens, and we saw a net improvement in precision but a net decrease in recall. (It bears noting that the increase in precision was not statistically significant.) In case it might be useful to others, I described the process we used to train the classifier as well as the p, r, and f1 values of both trained and default classifiers here.

    0 讨论(0)
  • 2021-02-05 13:55

    To train the PoS tagger, see this mailing list post which is also included in the JavaDocs for the MaxentTagger class.

    The javadocs for the edu.stanford.nlp.tagger.maxent.Train class specifies the training format:

    The training file should be in the following format: one word and one tag per line separated by a space or a tab. Each sentence should end in an EOS word-tag pair. (Actually, I'm not entirely sure that is still the case, but it probably won't hurt. -wmorgan)

    0 讨论(0)
  • 2021-02-05 13:59

    For the Stanford Parser, you use Penn treebank format, and see Stanford's FAQ about the exact commands to use. The JavaDocs for the LexicalizedParser class also give appropriate commands, particularly:

    java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] \
       -train trainFilesPath fileRange
       -saveToSerializedFile serializedGrammarFilename
    
    0 讨论(0)
提交回复
热议问题