问题
I have a corpus of more than 100k sentences and i have dictionary. i want to match the words in the corpus and tagged them in the sentences
corpus file "sentences.txt"
Hello how are you doing. Headache is dangerous
Malaria can be cure
he has anxiety thats why he is behaving like that.
she is doing well
he has psychological problems
Dictionary file "dict.csv"
abc, anxiety, disorder
def, Headache, symptom
hij, Malaria, virus
klm, headache, symptom
My python program
import csv
from difflib import SequenceMatcher as SM
from nltk.util import ngrams
import codecs
with open('dictionary.csv','r') as csvFile:
reader = csv.reader(csvFile)
myfile = open("sentences.txt", "rt")
my3file = open("tagged_sentences.txt", "w")
hay = myfile.read()
myfile.close()
for row in reader:
needle = row[1]
needle_length = len(needle.split())
max_sim_val = 0.9
max_sim_string = u""
for ngram in ngrams(hay.split(), needle_length + int(.2 * needle_length)):
hay_ngram = u" ".join(ngram)
similarity = SM(None, hay_ngram, needle).ratio()
if similarity > max_sim_val:
max_sim_val = similarity
max_sim_string = hay_ngram
str = [row[1] , ' ', max_sim_val.__str__(),' ', max_sim_string , '\n']
str1 = max_sim_string , row[2]
for line in hay.splitlines():
if max_sim_string in line:
tag_sent = line.replace(max_sim_string, str1.__str__())
my3file.writelines(tag_sent + '\n')
print(tag_sent)
break
csvFile.close()
my ouput for now is
he has ('anxiety', ' disorder') thats why he is behaving like that.
('Malaria', ' virus') can be cure
Hello how are you doing. ('Headache', ' symptom') is dangerous
I want my output as. i want it tags the words in the sentences in the same file "sentences.txt" or write it in new file "myfile3.txt. without disturbing the order of sentences or totally ignore (not adding) it
Hello how are you doing. ('Headache', 'symptom') is dangerous
('Malaria', ' virus') can be cure.
he has ('anxiety', ' disorder') thats why he is behaving like that
she is doing well
he has psychological problems
回答1:
Without changing much in your code this should make it work:
...
phrases = []
for row in reader:
needle = row[1]
needle_length = len(needle.split())
max_sim_val = 0.9
max_sim_string = u""
for ngram in ngrams(hay.split(), needle_length + int(.2 * needle_length)):
hay_ngram = u" ".join(ngram)
similarity = SM(None, hay_ngram, needle).ratio()
if similarity > max_sim_val:
max_sim_val = similarity
max_sim_string = hay_ngram
str = [row[1] , ' ', max_sim_val.__str__(),' ', max_sim_string , '\n']
str1 = max_sim_string , row[2]
phrases.append((max_sim_string, row[2]))
for line in hay.splitlines():
if any(max_sim_string in line for max_sim_string, _ in phrases):
for phrase in phrases:
max_sim_string, _ = phrase
if max_sim_string in line:
tag_sent = line.replace(max_sim_string, phrase.__str__())
my3file.writelines(tag_sent + '\n')
print(tag_sent)
break
else:
my3file.writelines(line + '\n')
csvFile.close()
回答2:
If you want your output in the order of the sentence input, then you need to build your output with respect to that order. Instead, you designed your program to report results in the order of the dictionary. You need to switch your inner and outer loops.
Read the dict file into an internal data structure, so you don't have to keep resetting and rereading the file.
Then read the sentence file, one line at a time. Look for words to tag (you already do that well). Make the replacements as you're doing, and write out the altered sentence.
来源:https://stackoverflow.com/questions/59568952/tagging-words-in-sentences-using-dictionares