What is efficient way to check if current word is close to a word in string?

问题

consider examples below :

Example 1 :
```
str1 = "wow...it  looks amazing"
str2 = "looks amazi"
```
You see that amazi is close to amazing, str2 is mistyped, i wanted to write a program that will tell me that amazi is close to amazing then in str2 i will replace amazi with amazing
Example 2 :
```
str1 = "is looking good"
str2 = "looks goo"
```
In this case updated str2 will be "looking good"
Example 3 :
```
str1 = "you are really looking good"
str2 = "lok goo"
```
In this case str2 will be "good" as lok is not close to looking (or even if program can convert in this case lok to looking then it's just fine for my problem's solution)

Example 4 :

str1 = "Stu is actually SEVERLY sunburnt....it hurts!!!"
str2 = "hurts!!"

Updated str2 will be "hurts!!!"

Example 5 :
```
str1 = "you guys were absolutely amazing tonight, a..."
str2 = "ly amazin"
```
Updated str2 will be "amazing", "ly" shall be removed or replace by absolutely.

What will be the algo and code for this?

Maybe we can do it by looking at character lexicographically and set a threshold like 0.8 or 80% so if word2 gets 80% sequential characters of word1 from str1 then we replace word2 in str2 with word of str1? Any other efficient solution with python code please?

回答1:

There are a lot of ways to approach this. This one solves all of your examples. I added a minimum similarity filter to return only the higher quality matches. This is what allows the 'ly' to be dropped in the last sample, as it is not all that close any any of the words.

Documentation

You can install levenshtein with pip install python-Levenshtein

import Levenshtein

def find_match(str1,str2):
    min_similarity = .75
    output = []
    results = [[Levenshtein.jaro_winkler(x,y) for x in str1.split()] for y in str2.split()]
    for x in results:
        if max(x) >= min_similarity:
            output.append(str1.split()[x.index(max(x))])
    return output

Each sample you proposed.

find_match("is looking good", "looks goo")

['looking','good']

find_match("you are really looking good", "lok goo")

['looking','good']

find_match("Stu is actually SEVERLY sunburnt....it hurts!!!", "hurts!!")

['hurts!!!']

find_match("you guys were absolutely amazing tonight, a...", "ly amazin")

['amazing']

回答2:

Like this:

str1 = "wow...it looks amazing"
str2 =  "looks amazi"
str3 = []

# Checking for similar strings in both strings:
for n in str1.split():
    for m in str2.split():
        if m in n:
            str3.append(n)

# If found 2 similar strings:
if len(str3) == 2:
    # If their indexes align:
    if str1.split().index(str3[1]) - str1.split().index(str3[0]) == 1:
        print(' '.join(str3))

elif len(str3) == 1:
    print(str3[0])

Output:

looks amazing

UPDATE with condition given by the OP:

str1 = "good..."
str2 =  "god.."
str3 = []

# Checking for similar strings in both strings:
for n in str1.split():
    for m in str2.split():

        # Calculating matching character in the 2 words:
        c = ''
        for i in m:
            if i in n:
                c+=i
        # If the amount of matching characters is greater or equal to 50% the length of the larger word
        # or the smaller word is in the larger word:
        if len(list(c)) >= len(n)*0.50 or m in n:
            str3.append(n)


# If found 2 similar strings:
if len(str3) == 2:
    # If their indexes align:
    if str1.split().index(str3[1]) - str1.split().index(str3[0]) == 1:
        print(' '.join(str3))

elif len(str3) == 1:
    print(str3[0])

回答3:

I made through it with regular expressions

def check_regex(str1,str2):
    #New list to store the updated value
    str_new = []
    for i in str2:
        # regular expression for comparing the strings
        x = ['['+i+']','^'+i,i+'$','('+i+')']
        for k in x:
            h=0
            for j in str1:
                #Conditions to make sure the word is close enough to the particular word
                if "".join(re.findall(k,j)) == i or ("".join(re.findall(k,j)) in i and abs(len("".join(re.findall(k,j)))-len(i)) == 1 and len(i)!=2):
                    str_new.append(j)
                    h=1
                    break
            if h==1:
                break
    return str_new
import re
str1 = input().split()
str2 = input().split()
print(" ".join(check_regex(str1,str2)))

回答4:

You can use Jacard coefficient in this case. First, you need to split your first and second string by space. After that, for every string in str2, take Jacard coefficient with every string in str1, then replace with which that gives you the highest Jacard coefficient.

You can use sklearn.metrics.jaccard_score.

来源：https://stackoverflow.com/questions/62106645/what-is-efficient-way-to-check-if-current-word-is-close-to-a-word-in-string

标签

python

python-3.x

string

pattern-matching

stop-words