问题
I have implemented a fuzzy matching algorithm and I would like to evaluate its recall using some sample queries with test data.
Let's say I have a document containing the text:
{"text": "The quick brown fox jumps over the lazy dog"}
I want to see if I can retrieve it by testing queries such as "sox" or "hazy drog" instead of "fox" and "lazy dog".
In other words, I want to add noise to strings to generate misspelled words (typos).
What would be a way of automatically generating words with typos for evaluating fuzzy search?
回答1:
I would just create a program to randomly alter letters in your words. I guess you can elaborate for specific requirements of your case, but the general idea would go like this.
Say you have a phrase
phrase = "The quick brown fox jumps over the lazy dog"
Then define a probability for a word to change (say 10%)
p = 0.1
Then loop over the words of your phrase and sample from a uniform distribution for each one of them. If the random variable is lower than your threshold, then randomly change one letter from the word
import string
import random
new_phrase = []
words = phrase.split(' ')
for word in words:
outcome = random.random()
if outcome <= p:
ix = random.choice(range(len(word)))
new_word = ''.join([word[w] if w != ix else random.choice(string.ascii_letters) for w in range(len(word))])
new_phrase.append(new_word)
else:
new_phrase.append(word)
new_phrase = ' '.join([w for w in new_phrase])
In my case I got the following interesting phrase result
print(new_phrase)
'The quick brown fWx jumps ovey the lazy dog'
回答2:
Haven't used this myself, but a quick google search found https://www.dcs.bbk.ac.uk/~ROGER/corpora.html which I guess you can use to get frequent misspellings for words in your text. You can also generate misspellings yourself using keyboard distance, as explained here, I guess: Edit distance such as Levenshtein taking into account proximity on keyboard Perhaps there are some other databases/corpora of frequent misspellings other than the one referred to above, because I would guess that just randomly inserting/deleting/changing characters with a total levenhstein distance of, say, max 3 will not be a useful evaluation of your system, since people don't randomly make mistakes, but exhibit simple, logical patterns in the types of (spelling) mistakes made.
来源:https://stackoverflow.com/questions/51079986/generate-misspelled-words-typos