How to use spacy to do Name Entity recognition on CSV file

后端 未结 1 1287
情话喂你
情话喂你 2021-01-28 15:47

I have tried so many things to do name entity recognition on a column in my csv file, i tried ne_chunk but i am unable to get the result of my ne_chunk in columns like so

<
相关标签:
1条回答
  • 2021-01-28 16:06

    It seems that you are checking the chunks incorrectly, that's why you get a key error. I'm guessing a little about what you want to do, but this creates new columns for each NER type returned by NLTK. It would be a little cleaner to predefined and zero each NER type column (as this gives you NaN if NERs don't exist).

    def extract_ner_count(tagged):
        entities = {}
        chunks = nltk.ne_chunk(tagged)
        for chunk in chunks:
            if type(chunk) is nltk.Tree:
              #if you don't need the entities, just add the label directly rather than this.
              t = ''.join(c[0] for c in chunk.leaves())
              entities[t] = chunk.label()
        return Counter(entities.values())
    
    news=pd.read_csv("news.csv")
    news['tokenize'] = news.apply(lambda row: nltk.word_tokenize(row['STORY']), axis=1)
    news['pos_tags'] = news.apply(lambda row: nltk.pos_tag(row['tokenize']), axis=1)
    news['entityrecognition']=news.apply(lambda row: extract_ner_count(row['pos_tags']), axis=1)
    news = pd.concat([news, pd.DataFrame(list(news["entityrecognition"]))], axis=1)
    
    print(news.head())
    

    If all you want is the counts the following is more performant and doesn't have NaNs:

    tagger = nltk.PerceptronTagger()
    chunker = nltk.data.load(nltk.chunk._MULTICLASS_NE_CHUNKER)
    NE_Types = {'GPE', 'ORGANIZATION', 'LOCATION', 'GSP', 'O', 'FACILITY', 'PERSON'}
    
    def extract_ner_count(text):
        c = Counter()
        chunks = chunker.parse(tagger.tag(nltk.word_tokenize(text,preserve_line=True)))
        for chunk in chunks:
            if type(chunk) is nltk.Tree:
                c.update([chunk.label()])
        return c
    
    news=pd.read_csv("news.csv")
    for NE_Type in NE_Types:
        news[NE_Type] = 0
    news.update(list(news["STORY"].apply(extract_ner_count)))
    
    print(news.head())
    
    0 讨论(0)
提交回复
热议问题