edit-distance | 易学教程

Efficient way of calculating likeness scores of strings when sample size is large?

阅读更多关于 Efficient way of calculating likeness scores of strings when sample size is large?

问题 Let's say that you have a list of 10,000 email addresses, and you'd like to find what some of the closest "neighbors" in this list are - defined as email addresses that are suspiciously close to other email addresses in your list. I'm aware of how to calculate the Levenshtein distance between two strings (thanks to this question), which will give me a score of how many operations are needed to transform one string into another. Let's say that I define "suspiciously close to another email

Quickly compare a string against a Collection in Java

阅读更多关于 Quickly compare a string against a Collection in Java

问题 I am trying to calculate edit distances of a string against a collection to find the closest match. My current problem is that the collection is very large (about 25000 items), so I had to narrow down the set to just strings of similar lengths but that still would only narrow it down to a few thousand strings and this still is very slow. Is there a datastructure that allows for a quick lookup of similar strings or is there another way I could address this problem? 回答1: Sounds like a BK-tree

How do you implement Levenshtein distance in Delphi?

阅读更多关于 How do you implement Levenshtein distance in Delphi?

问题 I'm posting this in the spirit of answering your own questions. The question I had was: How can I implement the Levenshtein algorithm for calculating edit-distance between two strings, as described here, in Delphi? Just a note on performance: This thing is very fast. On my desktop (2.33 Ghz dual-core, 2GB ram, WinXP), I can run through an array of 100K strings in less than one second. 回答1: function EditDistance(s, t: string): integer; var d : array of array of integer; i,j,cost : integer;

How to determine differences in two lists of data

阅读更多关于 How to determine differences in two lists of data

问题 This is an exercise for the CS guys to shine with the theory. Imagine you have 2 containers with elements. Folders, URLs, Files, Strings, it really doesn't matter. What is AN algorithm to calculate the added and the removed? Notice : If there are many ways to solve this problem, please post one per answer so it can be analysed and voted up. Edit : All the answers solve the matter with 4 containers. Is it possible to use only the initial 2? 回答1: Assuming you have two lists of unique items, and

ipython Pandas : How can I compare different rows of one column with Levenshtein distance metric?

阅读更多关于 ipython Pandas : How can I compare different rows of one column with Levenshtein distance metric?

问题 I have a table like this: id name 1 gfh 2 bob 3 boby 4 hgf etc. I am wondering how can I use Levenshtein metric to compare different rows of my 'name' column? I already know that I can use this to compare columns: L.distance('Hello, Word!', 'Hallo, World!') But how about rows? Can anybody help? 回答1: Here is a way to do it with pandas and numpy: from numpy import triu, ones t = """id name 1 gfh 2 bob 3 boby 4 hgf""" df = pd.read_csv(pd.core.common.StringIO(t), sep='\s{1,}').set_index('id')

how do you make a string dictionary function in lua?

阅读更多关于 how do you make a string dictionary function in lua?

问题 Is there a way if a string is close to a string in a table it will replace it with the one in the table? Like a spellcheck function, that searches through a table and if the input is close to one in the table it will fix it , so the one in the table and the string is the same? 回答1: You can use this code :) Reference code is from here : https://github.com/badarsh2/Algorithm-Implementations/blob/master/Levenshtein_distance/Lua/Yonaba/levenshtein.lua local function min(a, b, c) return math.min

how to efficiently check if the Levenshtein edit distance between two string is 1 [closed]

阅读更多关于 how to efficiently check if the Levenshtein edit distance between two string is 1 [closed]

问题 It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center. Closed 7 years ago . please note that it doesn't require to really calculate Levenshtein edit distance. just check it's 1 or not. The signature of the method may look like this: bool Is1EditDistance(string s1, string s2). for example

use edit distance on arrays in perl

阅读更多关于 use edit distance on arrays in perl

问题 I am attempting to compare the edit distance between two arrays. I have tried using Text:Levenshtein. #!/usr/bin/perl -w use strict; use Text::Levenshtein qw(distance); my @words = qw(four foo bar); my @list = qw(foo fear); my @distances = distance(@list, @words); print "@distances\n"; #results: 3 2 0 3 I however want the results to appear as follows: 2 0 3 2 3 2 Taking the first element of @list through the array of @words and doing the same through out the rest of the elements of @list. I

Edit distance algorithm explanation

阅读更多关于 Edit distance algorithm explanation

问题 According to wikipedia, the definition of the recursive formula which calculates the Levenshtein distance between two strings a and b is the following: I don't understand why we don't take into consideration the cases in which we delete a[j] , or we insert b[i] . Also, correct me if I am wrong, isn't the case of insertion the same as the case of the deletion? I mean, instead of deleting a character from one string, we could insert the same character in the second string, and the opposite. So

Clustering string data with ELKI

阅读更多关于 Clustering string data with ELKI

问题 I need to cluster a large number of strings using ELKI based on the Edit Distance / Levenshtein Distance. Since the data set is too large, I'd like to avoid file based precomputed distance matrices. How can I (a) load string data in ELKI from a file (only "Labels")? (b) implement a distance function accessing the labels (extend AbstractDBIDDistanceFunction, but how to get the labels?) Some code snippets or example input files would be helpful. 回答1: It's actually pretty straightforward: A )