How to train custom NER in Spacy with single words data set?

故事扮演 提交于 2021-01-29 13:22:26

问题


I am trying to train a custom ner in Spacy with the new entity 'ANIMAL'. But I have a data set with single words as:

TRAIN_DATA = [("Whale_ Blue", {"entities": [(0,11,LABEL)]}), ("Shark_ whale", {"entities": [(0,12,LABEL)]}), ("Elephant_ African", {"entities": [(0,17,LABEL)]}), ("Elephant_ Indian", {"entities": [(0,16,LABEL)]}), ("Giraffe_ male", {"entities": [(0,13,LABEL)]}), ("Mule", {"entities": [(0,4,LABEL)]}), ("Camel", {"entities": [(0,5,LABEL)]}), ("Horse", {"entities": [(0,5,LABEL)]}), ("Cow", {"entities": [(0,3,LABEL)]}), ("Dolphin_ Bottlenose", {"entities": [(0,19,LABEL)]}), ("Donkey", {"entities": [(0,6,LABEL)]}), ("Tapir", {"entities": [(0,5,LABEL)]}), ("Shark_ Hammerhead", {"entities": [(0,17,LABEL)]}), ("Seal_ fur", {"entities": [(0,9,LABEL)]}), ("Manatee", {"entities": [(0,7,LABEL)]}), ("Bear_ Grizzly", {"entities": [(0,13,LABEL)]}), ("Alligator_ American", {"entities": [(0,19,LABEL)]}), ("Sturgeon_ Atlantic", {"entities": [(0,18,LABEL)]}), ("Lion", {"entities": [(0,4,LABEL)]}), ("Bear_ American Black", {"entities": [(0,20,LABEL)]}), ("Ostrich", {"entities": [(0,7,LABEL)]}), ("Crocodile_ Saltwater", {"entities": [(0,20,LABEL)]}), ("Pig", {"entities": [(0,3,LABEL)]}), ("Sheep", {"entities": [(0,5,LABEL)]}), ("Dog_ Saint Bernard", {"entities": [(0,18,LABEL)]}), ("Human", {"entities": [(0,5,LABEL)]}), ("Deer_ white-tailed", {"entities": [(0,18,LABEL)]}), ("Tuna", {"entities": [(0,4,LABEL)]}), ("Salamander_ Japanese", {"entities": [(0,20,LABEL)]}), ("Carp", {"entities": [(0,4,LABEL)]}), ("Dog_ Foxhound", {"entities": [(0,13,LABEL)]}), ("Goat_ Milch", {"entities": [(0,11,LABEL)]}), ("Sting Ray", {"entities": [(0,9,LABEL)]}), ("Dog_ Pointer", {"entities": [(0,12,LABEL)]}), ("Kangaroo_ Red", {"entities": [(0,13,LABEL)]}), ("Cod_ Atlantic", {"entities": [(0,13,LABEL)]}), ("Dog_ Collie", {"entities": [(0,11,LABEL)]}), ("Pike_ Northern", {"entities": [(0,14,LABEL)]}), ("Trout_ brown", {"entities": [(0,12,LABEL)]}), ("Dog_ Basset Hound", {"entities": [(0,17,LABEL)]}), ("Turkey", {"entities": [(0,6,LABEL)]}), ("Porcupine", {"entities": [(0,9,LABEL)]}), ("Trout_ Rainbow", {"entities": [(0,14,LABEL)]}), ("Gar_ longnose", {"entities": [(0,13,LABEL)]}), ("Beaver", {"entities": [(0,6,LABEL)]}), ("Dog_ Irish Terrier", {"entities": [(0,18,LABEL)]}), ("Dog_ Beagle", {"entities": [(0,11,LABEL)]}), ("Bass_ Large Mouth Black", {"entities": [(0,23,LABEL)]}), ("Dog_ Whippet", {"entities": [(0,12,LABEL)]}), ("Dog_ Boston Terrier", {"entities": [(0,19,LABEL)]}), ("Nutria", {"entities": [(0,6,LABEL)]}), ("Dog_ Fox Terrier", {"entities": [(0,16,LABEL)]}), ("Armadillo_ Nine-banded", {"entities": [(0,22,LABEL)]}), ("Fox_ Arctic", {"entities": [(0,11,LABEL)]}), ("Woodchuck (Groundhog)", {"entities": [(0,21,LABEL)]}), ("Rabbit_ Domestic", {"entities": [(0,16,LABEL)]}), ("Chicken", {"entities": [(0,7,LABEL)]}), ("Dog_ Pekingese", {"entities": [(0,14,LABEL)]}), ("Haddock", {"entities": [(0,7,LABEL)]}), ("Cat_ domestic", {"entities": [(0,13,LABEL)]}), ("Salmon_ Chum", {"entities": [(0,12,LABEL)]}), ("Vulture_ Turkey", {"entities": [(0,15,LABEL)]}), ("Opossum_ Large American", {"entities": [(0,23,LABEL)]}), ("Flounder_ Winter", {"entities": [(0,16,LABEL)]}), ("Pheasant_ Ringnecked", {"entities": [(0,20,LABEL)]}), ("Perch", {"entities": [(0,5,LABEL)]}), ("Duck_ Mallard", {"entities": [(0,13,LABEL)]}), ("Mackerel_ Spanish", {"entities": [(0,17,LABEL)]}), ("Platypus_ Duck-billed", {"entities": [(0,21,LABEL)]}), ("Sea lamprey", {"entities": [(0,11,LABEL)]}), ("Bullhead_ Brown", {"entities": [(0,15,LABEL)]}), ("Mink_ American", {"entities": [(0,14,LABEL)]}), ("Falcon_ Peregrin", {"entities": [(0,16,LABEL)]}), ("Goshawk", {"entities": [(0,7,LABEL)]}), ("Bat_ Flying fox", {"entities": [(0,15,LABEL)]}), ("Duck_ Wood", {"entities": [(0,10,LABEL)]}), ("Buzzard", {"entities": [(0,7,LABEL)]}), ("Bass_ Rock", {"entities": [(0,10,LABEL)]}), ("Squirrel_ Gray", {"entities": [(0,14,LABEL)]}), ("Guinea Pig", {"entities": [(0,10,LABEL)]}), ("Rat_ Norway", {"entities": [(0,11,LABEL)]}), ("Gull_ Herring", {"entities": [(0,13,LABEL)]}), ("Crow_ Hooded", {"entities": [(0,12,LABEL)]}), ("Rook", {"entities": [(0,4,LABEL)]}), ("Pumpkinseed", {"entities": [(0,11,LABEL)]}), ("Pigeon", {"entities": [(0,6,LABEL)]}), ("Guinea fowl", {"entities": [(0,11,LABEL)]}), ("Quail_ Bobwhite", {"entities": [(0,15,LABEL)]}), ("Magpie_ Black-billed", {"entities": [(0,20,LABEL)]}), ("European Jackdaw", {"entities": [(0,16,LABEL)]}), ("Hamster", {"entities": [(0,7,LABEL)]}), ("Kestrel_ lesser", {"entities": [(0,15,LABEL)]}), ("Hawk_ Night", {"entities": [(0,11,LABEL)]}), ("Chipmunk_ Eastern", {"entities": [(0,17,LABEL)]}), ("Bat_ little brown", {"entities": [(0,17,LABEL)]}), ("Starling_ Common", {"entities": [(0,16,LABEL)]}), ("Frog_ leopard", {"entities": [(0,13,LABEL)]}), ("Weasel_ least", {"entities": [(0,13,LABEL)]}), ("Mouse_ White-footed", {"entities": [(0,19,LABEL)]}), ("Mouse_ House", {"entities": [(0,12,LABEL)]}), ("Canary", {"entities": [(0,6,LABEL)]}), ("Hummingbird", {"entities": [(0,11,LABEL)]}), ("Hummingbird_ Cuban bee", {"entities": [(0,22,LABEL)]}), ("Shrew_ Musked", {"entities": [(0,13,LABEL)]}), ("Shrew_ dwarf", {"entities": [(0,12,LABEL)]}), ("Goby_ Philippine", {"entities": [(0,16,LABEL)]}), ("Goldfish", {"entities": [(0,8,LABEL)]}), ("Toad_ American", {"entities": [(0,14,LABEL)]}), ("Frog_ Bull", {"entities": [(0,10,LABEL)]}), ("Eel_ American", {"entities": [(0,13,LABEL)]}), ("Penguin_ Adelie", {"entities": [(0,15,LABEL)]}), ("Robin", {"entities": [(0,5,LABEL)]}), ("Kiwi", {"entities": [(0,4,LABEL)]}), ("Fighting Fish_ Siamese", {"entities": [(0,22,LABEL)]}), ("Skate", {"entities": [(0,5,LABEL)]}), ("Quail_ Japanese/European", {"entities": [(0,24,LABEL)]}), ("Gila Monster", {"entities": [(0,12,LABEL)]}), ("Chameleon", {"entities": [(0,9,LABEL)]}), ("Cobra_ Indian", {"entities": [(0,13,LABEL)]}), ("Boa Constrictor", {"entities": [(0,15,LABEL)]}), ("Guppy", {"entities": [(0,5,LABEL)]}), ("Salamander_ Tiger", {"entities": [(0,17,LABEL)]}), ("Swordtail_ Mexican", {"entities": [(0,18,LABEL)]}), ("Stickleback_ three spine", {"entities": [(0,24,LABEL)]}), ("Sea horse", {"entities": [(0,9,LABEL)]}), ("Hellbender", {"entities": [(0,10,LABEL)]}), ("Herring_ Atlantic", {"entities": [(0,17,LABEL)]}), ("Chameleon_ Madagascar", {"entities": [(0,21,LABEL)]}), ("Frog_ Cuban", {"entities": [(0,11,LABEL)]}), ]

I have used the python script mention here https://github.com/explosion/spaCy/blob/master/examples/training/train_new_entity_type.py

After training the model, I am getting the incorrect result as Spacy also detects other words 'ANIMAL'.

Can anyone guide me, how to do this in the right way? Spacy ver: 2.1.8


回答1:


Spacy NER model training includes the extraction of other "implicit" features, such as POS and surrounding words.

When you attempt to train on single words, it is unable to get generalized enought features to detect those entities.

Take, for instance, this example extracted from Spacy's own training tutorial:

train_data = [
    ("Uber blew through $1 million a week", [(0, 4, 'ORG')]),
    ("Android Pay expands to Canada", [(0, 11, 'PRODUCT'), (23, 30, 'GPE')]),
    ("Spotify steps up Asia expansion", [(0, 8, "ORG"), (17, 21, "LOC")]),
    ("Google Maps launches location sharing", [(0, 11, "PRODUCT")]),
    ("Google rebrands its business apps", [(0, 6, "ORG")]),
    ("look what i found on google! 😂", [(21, 27, "PRODUCT")])]

How could the NER Model correctly guess what kind of Entity the word "Google" refers in that context if not by the surroundings? The same goes for your words. NER is not a "Regex"-like function, but rather a machine learning model.



来源:https://stackoverflow.com/questions/57511442/how-to-train-custom-ner-in-spacy-with-single-words-data-set

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!