How do I use IOB tags with Stanford NER?

前端 未结 1 662
花落未央
花落未央 2020-12-08 09:03

There seem to be a few different settings:

iobtags
iobTags
entitySubclassification (IOB1 or IOB2?)
evaluateIOB

Which setting do I use, and

相关标签:
1条回答
  • 2020-12-08 09:13

    How this can be done is currently (2013 releases) a bit of a mess, since there are two different sets of flags for two different DocumentReaderAndWriter implementations. Sorry.

    The most flexible support for different IOB styles is found in CoNLLDocumentReaderAndWriter. You can have it map any IOB/IOE/... annotation done by hyphenated prefixes like your examples (B-BRAND) to any other while it is reading files with the flag:

    -entitySubclassification IOB2
    

    The resulting label set is then used for training and classification. The options are documented in the entitySubclassify() method of CoNLLDocumentReaderAndWriter: IOB1, IOB2, IOE1, IOE2, SBIEO, IO. You can find a discussion of IOB1 vs. IOB2 in Tjong Kim Sang and Veenstra 1999. By default the representation is mapped back to IOB1 on output, since that is the default used in the CoNLL conlleval program, but you can keep it as what you mapped it to with the flag:

    -retainEntitySubclassification
    

    To use this DocumentReaderAndWriter, you can give a training command like:

    java8 -mx6g edu.stanford.nlp.ie.crf.CRFClassifier -prop conll.crf.chris2009.prop -readerAndWriter edu.stanford.nlp.sequences.CoNLLDocumentReaderAndWriter -entitySubclassification iob2
    

    Alternatively, ColumnDocumentReaderAndWriter is the default DocumentReaderAndWriter which we use in the distributed models. The options you get with it are different and slightly more limited. You have these two flags:

    • -mergeTags will take either plain ("BRAND") or CoNLL-like ("I-BRAND") labels and map them down to a prefix-less IO label ("BRAND") and use that for training and classifying.
    • -iobTags can take either plain ("BRAND") or CoNLL-like ("I-BRAND") labels and maps them to IOB2.

    In a sequence model, for any of the labeling schemes like IOB2, the labels are different classes. That is how these labeling schemes work. The special interpretation of "I-", "B-", etc. is left to the human observer and entity-level evaluation software. The included evaluation software will work with IOB1, IOB2, or prefixless IO encoding only.

    0 讨论(0)
提交回复
热议问题