Fast Way to Find Difference between Two Strings of Equal Length in Perl

前端 未结 4 1176
无人及你
无人及你 2021-01-03 01:20

Given pairs of string like this.

    my $s1 = \"ACTGGA\";
    my $s2 = \"AGTG-A\";

   # Note the string can be longer than this.

I would

相关标签:
4条回答
  • 2021-01-03 01:44

    I was bored on Thanksgiving break 2012 and answered the question and more. It will work on strings of equal length. It will work if they are not. I added a help, opt handling just for fun. I thought someone might find it useful. If you are new to PERL add don't know. Don't add any code in your script below DATA to the program. Have fun.

    ./diftxt -h

        usage: diftxt [-v ] string1 string2
                       -v = Verbose 
                      diftxt [-V|--version]
                      diftxt [-h|--help]  "This help!"
    Examples:  diftxt test text
               diftxt "This is a test" "this is real"
    
        Place Holders:  space = "·" , no charater = "ζ"
    

    cat ./diftxt ----------- cut ✂----------

    #!/usr/bin/perl -w
    
    use strict;
    use warnings;
    use Getopt::Std;
    my %options=();
    getopts("Vhv", \%options);
    my $helptxt='
            usage: diftxt [-v ] string1 string2
                           -v = Verbose 
                          diftxt [-V|--version]
                          diftxt [-h|--help]  "This help!"
        Examples:  diftxt test text
                   diftxt "This is a test" "this is real"
    
            Place Holders:  space = "·" , no charater = "ζ"';
    my $Version = "inital-release 1.0 - Quincey Craig 11/21/2012";
    
    print "$helptxt\n\n" if defined $options{h};
    print "$Version\n" if defined $options{V};
    if (@ARGV == 0 ) {
     if (not defined $options{h}) {usage()};
     exit;
    }
    
    my $s1 = "$ARGV[0]";
    my $s2 = "$ARGV[1]";
    my $mask = $s1 ^ $s2;
    
    #  setup unicode output to STDOUT
    binmode DATA, ":utf8";
    my $ustring = <DATA>;
    binmode STDOUT, ":utf8";
    
    my $_DIFF = '';
    my $_CHAR1 = '';
    my $_CHAR2 = '';
    
    sub usage
    {
            print "\n";
            print "usage: diftxt [-v ] string1 string2\n";
            print "               -v = Verbose \n";
            print "       diftxt [-V|--version]\n";
            print "       diftxt [-h|--help]\n\n";
            exit;
    }
    
    sub main
    {
     print "\nOrig\tDiff\tPos\n----\t----\t----\n" if defined $options{v};
     while ($mask =~ /[^\0]/g) {
    ### redirect stderr to allow for test of empty variable with error message from substr   
        open STDERR, '>/dev/null';
        if (substr($s2,$-[0],1) eq "") {$_CHAR2 = "\x{03B6}";close STDERR;} else {$_CHAR2 = substr($s2,$-[0],1)};
        if (substr($s2,$-[0],1) eq " ") {$_CHAR2 = "\x{00B7}"};
          $_CHAR1 = substr($s1,$-[0],1);
        if ($_CHAR1 eq "") {$_CHAR1 = "\x{03B6}"} else {$_CHAR1 = substr($s1,$-[0],1)};
        if ($_CHAR1 eq " ") {$_CHAR1 = "\x{00B7}"};
    ### Print verbose Data  
       print $_CHAR1, "\t", $_CHAR2, "\t", $+[0], "\n" if defined $options{v};
    ### Build difference list 
       $_DIFF = "$_DIFF$_CHAR2";
    ### Build mask 
       substr($s1,"$-[0]",1) = "\x{00B7}";
     } ### end loop
    
     print "\n" if defined $options{v};
     print "$_DIFF, ";
     print "Mask: \"$s1\"\n";
    } ### end main
    if ($#ARGV == 1) {main()};
    __DATA__
    
    0 讨论(0)
  • 2021-01-03 01:45

    Use binary bit ops on the complete strings.

    Things like $s1 & $s2 or $s1 ^ $s2 run incredibly fast, and work with strings of arbitrary length.

    0 讨论(0)
  • 2021-01-03 01:52

    Stringwise ^ is your friend:

    use strict;
    use warnings;
    my $s1 = "ACTGGA";
    my $s2 = "AGTG-A";
    
    my $mask = $s1 ^ $s2;
    while ($mask =~ /[^\0]/g) {
        print substr($s1,$-[0],1), ' ', substr($s2,$-[0],1), ' ', $-[0], "\n";
    }
    

    EXPLANATION:

    The ^ (exclusive or) operator, when used on strings, returns a string composed of the result of an exclusive or on each bit of the numeric value of each character. Breaking down an example into equivalent code:

    "AB" ^ "ab"
    ( "A" ^ "a" ) . ( "B" ^ "b" )
    chr( ord("A") ^ ord("a") ) . chr( ord("B") ^ ord("b") )
    chr( 65 ^ 97 ) . chr( 66 ^ 98 )
    chr(32) . chr(32)
    " " . " "
    "  "
    

    The useful feature of this here is that a nul character ("\0") occurs when and only when the two strings have the same character at a given position. So ^ can be used to efficiently compare every character of the two strings in one quick operation, and the result can be searched for non-nul characters (indicating a difference). The search can be repeated using the /g regex flag in scalar context, and the position of each character difference found using $-[0], which gives the offset of the beginning of the last successful match.

    0 讨论(0)
  • 2021-01-03 02:02

    This is the easiest form you can get

    my $s1 = "ACTGGA";
    my $s2 = "AGTG-A";
    
    my @s1 = split //,$s1;
    my @s2 = split //,$s2;
    
    my $i = 0;
    foreach  (@s1) {
        if ($_ ne $s2[$i]) {
            print "$_, $s2[$i] $i\n";
        }
        $i++;
    }
    
    0 讨论(0)
提交回复
热议问题