Tagging words in sentences using dictionares

£可爱£侵袭症+ 提交于 2020-01-16 16:03:46

问题


I have a corpus of more than 100k sentences and i have dictionary. i want to match the words in the corpus and tagged them in the sentences

corpus file "sentences.txt"

Hello how are you doing. Headache is dangerous
Malaria can be cure
he has anxiety thats why he is behaving like that.
she is doing well
he has psychological problems

Dictionary file "dict.csv"

abc, anxiety, disorder
def, Headache, symptom
hij, Malaria, virus
klm, headache, symptom

My python program

import csv
from difflib import SequenceMatcher as SM
from nltk.util import ngrams

import codecs

with open('dictionary.csv','r') as csvFile:
    reader = csv.reader(csvFile)
    myfile = open("sentences.txt", "rt")
    my3file = open("tagged_sentences.txt", "w")
    hay = myfile.read()
    myfile.close()

for row in reader:
    needle = row[1]
    needle_length = len(needle.split())
    max_sim_val = 0.9
    max_sim_string = u""
    for ngram in ngrams(hay.split(), needle_length + int(.2 * needle_length)):
        hay_ngram = u" ".join(ngram)

        similarity = SM(None, hay_ngram, needle).ratio()
        if similarity > max_sim_val:
            max_sim_val = similarity
            max_sim_string = hay_ngram
            str = [row[1] , ' ', max_sim_val.__str__(),' ', max_sim_string , '\n']
            str1 = max_sim_string , row[2]
            for line in hay.splitlines():
                if max_sim_string in line:
                    tag_sent = line.replace(max_sim_string, str1.__str__())
                    my3file.writelines(tag_sent + '\n')
                    print(tag_sent)
            break

csvFile.close()

my ouput for now is

 he has ('anxiety', ' disorder') thats why he is behaving like that.
 ('Malaria', ' virus') can be cure
 Hello how are you doing. ('Headache', ' symptom') is dangerous

I want my output as. i want it tags the words in the sentences in the same file "sentences.txt" or write it in new file "myfile3.txt. without disturbing the order of sentences or totally ignore (not adding) it

 Hello how are you doing. ('Headache', 'symptom') is dangerous
 ('Malaria', ' virus') can be cure.
 he has ('anxiety', ' disorder') thats why he is behaving like that
 she is doing well
 he has psychological problems

回答1:


Without changing much in your code this should make it work:

...
phrases = []
for row in reader:
    needle = row[1]
    needle_length = len(needle.split())
    max_sim_val = 0.9
    max_sim_string = u""
    for ngram in ngrams(hay.split(), needle_length + int(.2 * needle_length)):
        hay_ngram = u" ".join(ngram)

        similarity = SM(None, hay_ngram, needle).ratio()
        if similarity > max_sim_val:
            max_sim_val = similarity
            max_sim_string = hay_ngram
            str = [row[1] , ' ', max_sim_val.__str__(),' ', max_sim_string , '\n']
            str1 = max_sim_string , row[2]
            phrases.append((max_sim_string, row[2]))

for line in hay.splitlines():
    if any(max_sim_string in line for max_sim_string, _ in phrases):
        for phrase in phrases:
            max_sim_string, _ = phrase
            if max_sim_string in line:
                tag_sent = line.replace(max_sim_string, phrase.__str__())
                my3file.writelines(tag_sent + '\n')
                print(tag_sent)
                break        
    else:
        my3file.writelines(line + '\n')

csvFile.close()



回答2:


If you want your output in the order of the sentence input, then you need to build your output with respect to that order. Instead, you designed your program to report results in the order of the dictionary. You need to switch your inner and outer loops.

Read the dict file into an internal data structure, so you don't have to keep resetting and rereading the file.

Then read the sentence file, one line at a time. Look for words to tag (you already do that well). Make the replacements as you're doing, and write out the altered sentence.



来源:https://stackoverflow.com/questions/59568952/tagging-words-in-sentences-using-dictionares

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!