问题
I am trying to train a Spacy model to recognize a few custom NERs, the training data is given below, it is mostly related to recognizing a few server models, date in the FY format and Types of HDD:
TRAIN_DATA = [('Send me the number of units shipped in FY21 for A566TY server', {'entities': [(39, 42, 'DateParse'),(48,53,'server')]}),
('Send me the number of units shipped in FY-21 for A5890Y server', {'entities': [(39, 43, 'DateParse'),(49,53,'server')]}),
('How many systems sold with 3.5 inch drives in FY20-Q2 for F567?', {'entities': [(46, 52, 'DateParse'),(58,61,'server'),(27,29,'HDD')]}),
('Total revenue in FY20Q2 for 3.5 HDD', {'entities': [(17, 22, 'DateParse'),(28,30,'HDD')]}),
('How many systems sold with 3.5 inch drives in FY20-Q2 for F567?', {'entities': [(46, 52, 'DateParse'),(58,61,'server'),(27,29,'HDD')]}),
('Total units shipped in FY2017-FY2021', {'entities': [(23, 28, 'DateParse'),(30,35,'DateParse')]}),
('Total units shipped in FY 18', {'entities': [(23, 27, 'DateParse')]}),
('Total units shipped between FY16 and FY2021', {'entities': [(28, 31, 'DateParse'),(37,42,'DateParse')]})
]
def train_spacy(data,iterations):
TRAIN_DATA = data
nlp = spacy.blank('en') # create blank Language class
# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner, last=True)
# add labels
for _, annotations in TRAIN_DATA:
for ent in annotations.get('entities'):
ner.add_label(ent[2])
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
optimizer = nlp.begin_training()
for itn in range(iterations):
print("Statring iteration " + str(itn))
random.shuffle(TRAIN_DATA)
losses = {}
# batch up the examples using spaCy's minibatch
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(
texts, # batch of texts
annotations, # batch of annotations
drop=0.2, # dropout - make it harder to memorise data
losses=losses,
)
print("Losses", losses)
return nlp
But on running the code even on training data no entity is being returned.
prdnlp = train_spacy(TRAIN_DATA, 100)
for text, _ in TRAIN_DATA:
doc = prdnlp(text)
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])
The Output is coming as below:
回答1:
Spacy can currently only train from entity annotation that lines up with token boundaries. The main problem is that your span end characters are one character too short. The character start/end values should be just like string slices for the text:
text = "Send me the number of units shipped in FY21 for A566TY server"
# (39, 42, 'DateParse')
assert text[39:42] == "FY2"
You should have (39, 43, 'DateParse')
instead.
A secondary problem is that you may also need to adjust the tokenizer for cases like FY2017-FY2021
because the default English tokenizer treats this as one token, so the annotations [(23, 28, 'DateParse'),(30,35,'DateParse')]
would be ignored during training.
See a more detailed explanation here: https://github.com/explosion/spaCy/issues/4946#issuecomment-580663925
来源:https://stackoverflow.com/questions/60008854/spacy-custom-ner-is-not-returning-any-entity