问题
I am looking for an algorithm that does edit distance, but which will ignore start+end in the one string and white space:
edit("four","foor") = 1
edit("four","noise fo or blur") = 1
Is there an existing algorithm for that? Maybe even a Perl or a Python Library?
回答1:
The code to do this is simple in concept. It's your idea of what you'd like to ignore that you can add on your own:
#!perl
use v5.22;
use feature qw(signatures);
no warnings qw(experimental::signatures);
use Text::Levenshtein qw(distance);
say edit( "four", "foor" );
say edit( "four", "noise fo or blur" );
sub edit ( $start, $target ) {
# transform strings to ignore what you want
# ...
distance( $start, $target )
}
Maybe you want to check all substrings of the same length:
use v5.22;
use feature qw(signatures);
no warnings qw(experimental::signatures);
use Text::Levenshtein qw(distance);
say edit( "four", "foar" );
say edit( "four", "noise fo or blur" );
sub edit ( $start, $target ) {
my $start_length = length $start;
$target =~ s/\s+//g;
my @all_n_chars = map {
substr $target, $_, 4
} 0 .. ( length($target) - $start_length );
my $closest;
my $closest_distance = $start_length + 1;
foreach ( @all_n_chars ) {
my $distance = distance( $start, $_ );
if( $distance < $closest_distance ) {
$closest = $_;
$closest_distance = $distance;
say "closest: $closest Distance: $distance";
last if $distance == 0;
}
}
return $closest_distance;
}
This very simpleminded implementation finds what you want. However, realize that other random strings might accidentally have an edit distance that is lower.
closest: foar Distance: 1
1
closest: nois Distance: 3
closest: foor Distance: 1
1
You could extend this to remember the true starting positions of each string so you can find it again in the original, but this should be enough to send you on your way. If you wanted to use Python, I think the program might look very similar.
回答2:
Here's a Perl 6 solution. I use a grammar that knows how to grab four interesting characters despite interstitial stuff. More complex requirements require a different grammar, but that's not so hard.
Each time there's a match, the NString::Actions class object gets a change to inspect the match. It does the same high-water mark thing I was doing before. This looks like a bunch more work, and it is for this trivial example. For more complex examples, it's not going to be that much worse. My Perl 5 version would have to do a lot of tooling to figure out what to keep or not keep.
use Text::Levenshtein;
my $string = 'The quixotic purple and jasmine butterfly flew over the quick zany dog';
grammar NString {
regex n-chars { [<.ignore-chars>* \w]**4 }
regex ignore-chars { \s }
}
class NString::Actions {
# See
my subset IntInf where Int:D | Inf;
has $.target;
has Str $.closest is rw = '';
has IntInf $.closest-distance is rw = Inf;
method n-chars ($/) {
my $string = $/.subst: /\s+/, '', :g;
my $distance = distance( $string, self.target );
# say "Matched <$/>. Distance for $string is $distance";
if $distance < self.closest-distance {
self.closest = $string;
self.closest-distance = $distance;
}
}
}
my $action = NString::Actions.new: target => 'Perl';
loop {
state $from = 0;
my $match = NString.subparse(
$string,
:rule('n-chars'),
:actions($action),
:c($from)
);
last unless ?$match;
$from++;
}
say "Shortest is { $action.closest } with { $action.closest-distance }";
(I did a straight port from Perl 5, which I'll leave here)
I tried the same thing in Perl 6, but I'm sure that this is a bit verbose. I was wondering if there's a clever way to grab groups of N chars to compare. Maybe I'll have some improvement later.
use Text::Levenshtein;
put edit( "four", "foar" );
put edit( "four", "noise fo or blur" );
sub edit ( Str:D $start, Str:D $target --> Int:D ) {
my $target-modified = $target.subst: rx/\s+/, '', :g;
my $last-position-to-check = [-] map { .chars }, $target-modified, $start;
my $closest = Any;
my $closest-distance = $start.chars + 1;
for 0..$last-position-to-check -> $starting-pos {
my $substr = $target-modified.substr: $starting-pos, $start.chars;
my $this-distance = distance( $start, $substr );
put "So far: $substr -> $this-distance";
if $this-distance < $closest-distance {
$closest = $substr;
$closest-distance = $this-distance;
}
last if $this-distance = 0;
}
return $closest-distance // -1;
}
来源:https://stackoverflow.com/questions/45254702/edit-distance-ignore-start-end