Fastest Way To Find Mismatch Positions Between Two Strings of the Same Length

前端 未结 9 1331
后悔当初
后悔当初 2020-12-28 22:08

I have a millions of pairs of string of same length which I want to compare and find the position where it has mismatches.

For example for each $str1 a

相关标签:
9条回答
  • 2020-12-28 22:31

    Some classic string compare optimizations:

    optimal mismatch - start comparing at the END of the search string. e.g. search for ABC in ABDABEABF if you compare at the beginning you will move along the pattern one char at a time. If you search from the end you will be able to jump along three chars

    bad character heuristic - select the least commonly occurring char and search on that first. e.g. in english a 'z' character is rare and good string search functions will take a search for 'maze' and start comparing on the 3rd char

    0 讨论(0)
  • 2020-12-28 22:36

    I was going to say, "write it in C" too.

    Once there you can use optimization like comparing 4 characters at once (as 32-bit integers).

    Or change your representation (4-letter, right?) to use 2-bit to represent a base (?), so that you can compare 16 characters at once.

    0 讨论(0)
  • 2020-12-28 22:38

    I don't know how efficient it is, but you could always xor the two strings you are matching, and find the index of the first mismatch.

    #! /usr/bin/env perl
    use strict;
    use warnings;
    use 5.10.1;
    
    my $str_source = "ATTCCGGG";
    
    my $str1       = "ATTGCGGG";
    my $str2       = "ATACCGGC";
    my $str3       = "GTTCCGGG";
    
    # this returns the index of all of the mismatches (zero based)
    # it returns an empty list if the two strings match.
    sub diff_index{
      my($a,$b) = @_;
      my $cmp = $a^$b;
    
      my @cmp;
      while( $cmp =~ /[^\0]/g ){ # match non-zero byte
        push @cmp, pos($cmp) - 1;
      }
      return @cmp;
    }
    
    for my $str ( $str_source, $str1, $str2, $str3 ){
      say '# "', $str, '"';
      my @ret = diff_index $str_source, $str;
      if( @ret ){
        say '[ ', join( ', ', @ret), ' ]';
      }else{
        say '#   match';
      }
    }
    
    # "ATTCCGGG"
    #   match
    # "ATTGCGGG"
    [ 3 ]
    # "ATACCGGC"
    [ 2, 7 ]
    # "GTTCCGGG"
    [ 0 ]
    

    Running it through B::Concise shows that the CPU expensive operations, happen as single opcodes. Which means that those operations are run in C.

    perl -MO=Concise,-exec,-compact,-src,diff_index test.pl |
    perl -pE's/^[^#].*? \K([^\s]+)$/# $1/' # To fix highlighting bugs
    
    main::diff_index:
    # 15:   my($a,$b) = @_;
    1  <;> nextstate(main 53 test.pl:15) # v:%,*,&,$
    2  <0> pushmark # s
    3  <$> gv(*_) # s
    4  <1> rv2av[t3] # lK/3
    5  <0> pushmark # sRM*/128
    6  <0> padsv[$a:53,58] # lRM*/LVINTRO
    7  <0> padsv[$b:53,58] # lRM*/LVINTRO
    8  <2> aassign[t4] # vKS
    # 16:   my $cmp = $a^$b;
    9  <;> nextstate(main 54 test.pl:16) # v:%,*,&,$
    a  <0> padsv[$a:53,58] # s
    b  <0> padsv[$b:53,58] # s
    c  <2> bit_xor[t6] # sK                     <-----  Single OP -----
    d  <0> padsv[$cmp:54,58] # sRM*/LVINTRO
    e  <2> sassign # vKS/2
    # 18:   my @cmp;
    f  <;> nextstate(main 55 test.pl:18) # v:%,*,&,{,$
    g  <0> padav[@cmp:55,58] # vM/LVINTRO
    # 20:   while( $cmp =~ /[^\0]/g ){ # match non-zero byte
    h  <;> nextstate(main 57 test.pl:20) # v:%,*,&,{,$
    i  <{> enterloop(next->r last->v redo->j) # v
    s  <0> padsv[$cmp:54,58] # s
    t  </> match(/"[^\\0]"/) # sKS/RTIME        <-----  Single OP -----
    u  <|> and(other->j) # vK/1
    # 21:     push @cmp, pos($cmp) - 1;
    j      <;> nextstate(main 56 test.pl:21) # v:%,*,&,$
    k      <0> pushmark # s
    l      <0> padav[@cmp:55,58] # lRM
    m      <0> padsv[$cmp:54,58] # sRM
    n      <1> pos[t8] # sK/1
    o      <$> const(IV 1) # s
    p      <2> subtract[t9] # sK/2
    q      <@> push[t10] # vK/2
    r      <0> unstack # v
               goto # s
    v  <2> leaveloop # vK/2
    # 24:   return @cmp;
    w  <;> nextstate(main 58 test.pl:24) # v:%,*,&,{,$
    x  <0> pushmark # s
    y  <0> padav[@cmp:55,58] 
    z  <@> return # K
    10 <1> leavesub[1 ref] # K/REFC,1
    
    0 讨论(0)
  • 2020-12-28 22:39

    Here is a benchmarking script to figure out if the differences in speed of various approaches. Just keep in mind that there will be a lag the first time a script using Inline::C is invoked as the C compiler is invoked etc. So, run the script once, and then benchmark.

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    
    use Benchmark qw( cmpthese );
    
    my ($copies) = @ARGV;
    $copies ||= 1;
    
    my $x = 'ATTCCGGG' x $copies;
    my $y = 'ATTGCGGG' x $copies;
    my $z = 'ATACCGGC' x $copies;
    
    sub wrapper { 
        my ($func, @args) = @_;
        for my $s (@args) {
            my $differences = $func->($x, $s);
            # just trying to ensure results are not discarded
            if ( @$differences == 0 ) { 
                print "There is no difference\n";
            }
        }
        return;
    }
    
    cmpthese -5, {
        explode  => sub { wrapper(\&where_do_they_differ, $y, $z) },
        mism_pos => sub { wrapper(\&mism_pos, $y, $z) },
        inline_c => sub {
            wrapper(\&i_dont_know_how_to_do_stuff_with_inline_c, $y, $z) },
    };
    
    sub where_do_they_differ {
        my ($str1, $str2) = @_;
        my @str1 = split //, $str1;
        my @str2 = split //, $str2;
        [ map {$str1[$_] eq $str2[$_] ? () : $_} 0 .. length($str1) - 1 ];
    }
    
    sub mism_pos {
        my ($str1, $str2) = @_;
        my @mism_pos;
    
        for my $i (0 .. length($str1) - 1) {
            if (substr($str1, $i, 1) ne substr($str2, $i, 1) ) {
                push @mism_pos, $i;
            }
        }
        return \@mism_pos;
    }
    
    sub i_dont_know_how_to_do_stuff_with_inline_c {
        [ find_diffs(@_) ];
    }
    
    use Inline C => << 'EOC';
        void find_diffs(char* x, char* y) {
            int i;
            Inline_Stack_Vars;
            Inline_Stack_Reset;
            for(i=0; x[i] && y[i]; ++i) {
                if(x[i] != y[i]) {
                    Inline_Stack_Push(sv_2mortal(newSViv(i)));
                }
            }
            Inline_Stack_Done;
        }
    EOC
    

    Results (using VC++ 9 on Windows XP with AS Perl 5.10.1) with $copies = 1:

                Rate  explode mism_pos inline_c
    explode  15475/s       --     -64%     -84%
    mism_pos 43196/s     179%       --     -56%
    inline_c 98378/s     536%     128%       --
    

    Results with $copies = 100:

                Rate  explode mism_pos inline_c
    explode    160/s       --     -86%     -99%
    mism_pos  1106/s     593%       --     -90%
    inline_c 10808/s    6667%     877%       --
    
    0 讨论(0)
  • 2020-12-28 22:47

    It sounds like this might be a performance critical part of your application. In this case, you may want to consider writing a C extension method to do the comparison.

    Perl provides the XS extension mechanism which makes this reasonably straightforward.

    0 讨论(0)
  • Inline::C


    The computation is easy, do it with Inline::C (read perldoc Inline::C-Cookbook and perldoc Inline::C for documentation):

    use Inline C => << '...';                                                       
      void find_diffs(char* x, char* y) {                                           
        int i;                                                                      
        Inline_Stack_Vars;                                                          
        Inline_Stack_Reset;                                                         
        for(i=0; x[i] && y[i]; ++i) {                                               
          if(x[i] != y[i]) {                                                        
            Inline_Stack_Push(sv_2mortal(newSViv(i)));                              
          }                                                                         
        }                                                                           
        Inline_Stack_Done;                                                          
      }                                                                             
    ...                                                                             
    
    @diffs= find_diffs("ATTCCGGG","ATTGCGGG");  print "@diffs\n";                   
    @diffs= find_diffs("ATTCCGGG","ATACCGGC");  print "@diffs\n";                   
    

    Here is the output of this script:

    > script.pl 
    3
    2 7
    

    PDL

    If you want to process a lot of data fast in Perl, learn PDL (Documentation):

    use PDL; 
    use PDL::Char;                                                                  
    $PDL::SHARE=$PDL::SHARE; # keep stray warning quiet 
    
    my $source=PDL::Char->new("ATTCCGGG");                                          
    for my $str ( "ATTGCGGG", "ATACCGGC") {                                         
      my $match =PDL::Char->new($str);                                              
      my @diff=which($match!=$source)->list;                                        
      print "@diff\n";                                                              
    }
    

    (Same output as first script.)

    Notes: I used PDL very happily in genomic data processing. Together with memory mapped access to data stored on the disk, huge amounts of data can be processed quickly: all processing is done in highly optimized C loops. Also, you can easily access the same data through Inline::C for any features missing in PDL.

    Note however, that the creation of one PDL vector is quite slow (constant time, it's acceptable for large data structures). So, you rather want to create one large PDL object with all your input data in one go than looping over individual data elements.

    0 讨论(0)
提交回复
热议问题