Tagging words in sentences using dictionares

问题

I have a corpus of more than 100k sentences and i have dictionary. i want to match the words in the corpus and tagged them in the sentences

corpus file "sentences.txt"

Hello how are you doing. Headache is dangerous
Malaria can be cure
he has anxiety thats why he is behaving like that.
she is doing well
he has psychological problems

Dictionary file "dict.csv"

abc, anxiety, disorder
def, Headache, symptom
hij, Malaria, virus
klm, headache, symptom

My python program

import csv
from difflib import SequenceMatcher as SM
from nltk.util import ngrams

import codecs

with open('dictionary.csv','r') as csvFile:
    reader = csv.reader(csvFile)
    myfile = open("sentences.txt", "rt")
    my3file = open("tagged_sentences.txt", "w")
    hay = myfile.read()
    myfile.close()

for row in reader:
    needle = row[1]
    needle_length = len(needle.split())
    max_sim_val = 0.9
    max_sim_string = u""
    for ngram in ngrams(hay.split(), needle_length + int(.2 * needle_length)):
        hay_ngram = u" ".join(ngram)

        similarity = SM(None, hay_ngram, needle).ratio()
        if similarity > max_sim_val:
            max_sim_val = similarity
            max_sim_string = hay_ngram
            str = [row[1] , ' ', max_sim_val.__str__(),' ', max_sim_string , '\n']
            str1 = max_sim_string , row[2]
            for line in hay.splitlines():
                if max_sim_string in line:
                    tag_sent = line.replace(max_sim_string, str1.__str__())
                    my3file.writelines(tag_sent + '\n')
                    print(tag_sent)
            break

csvFile.close()

my ouput for now is

 he has ('anxiety', ' disorder') thats why he is behaving like that.
 ('Malaria', ' virus') can be cure
 Hello how are you doing. ('Headache', ' symptom') is dangerous

I want my output as. i want it tags the words in the sentences in the same file "sentences.txt" or write it in new file "myfile3.txt. without disturbing the order of sentences or totally ignore (not adding) it

 Hello how are you doing. ('Headache', 'symptom') is dangerous
 ('Malaria', ' virus') can be cure.
 he has ('anxiety', ' disorder') thats why he is behaving like that
 she is doing well
 he has psychological problems

回答1:

Without changing much in your code this should make it work:

...
phrases = []
for row in reader:
    needle = row[1]
    needle_length = len(needle.split())
    max_sim_val = 0.9
    max_sim_string = u""
    for ngram in ngrams(hay.split(), needle_length + int(.2 * needle_length)):
        hay_ngram = u" ".join(ngram)

        similarity = SM(None, hay_ngram, needle).ratio()
        if similarity > max_sim_val:
            max_sim_val = similarity
            max_sim_string = hay_ngram
            str = [row[1] , ' ', max_sim_val.__str__(),' ', max_sim_string , '\n']
            str1 = max_sim_string , row[2]
            phrases.append((max_sim_string, row[2]))

for line in hay.splitlines():
    if any(max_sim_string in line for max_sim_string, _ in phrases):
        for phrase in phrases:
            max_sim_string, _ = phrase
            if max_sim_string in line:
                tag_sent = line.replace(max_sim_string, phrase.__str__())
                my3file.writelines(tag_sent + '\n')
                print(tag_sent)
                break        
    else:
        my3file.writelines(line + '\n')

csvFile.close()

回答2:

If you want your output in the order of the sentence input, then you need to build your output with respect to that order. Instead, you designed your program to report results in the order of the dictionary. You need to switch your inner and outer loops.

Read the dict file into an internal data structure, so you don't have to keep resetting and rereading the file.

Then read the sentence file, one line at a time. Look for words to tag (you already do that well). Make the replacements as you're doing, and write out the altered sentence.

来源：https://stackoverflow.com/questions/59568952/tagging-words-in-sentences-using-dictionares

标签

python

dictionary

tagging

ner