Fastest Way To Find Mismatch Positions Between Two Strings of the Same Length

前端 未结 9 1332
后悔当初
后悔当初 2020-12-28 22:08

I have a millions of pairs of string of same length which I want to compare and find the position where it has mismatches.

For example for each $str1 a

相关标签:
9条回答
  • 2020-12-28 22:53

    You're making 2 calls to substr for each character comparison which is probably what's slowing you down.

    A few optimizations I would make

    @source = split //,$str_source  #split first rather than substr
    @base = split //, $str_base
    
    for $i (0 .. length($str_source)) {
       $mism_pos{$1} = 1 if ($source[$i] ne $base); #hashing is faster than array push
    }
    
    return keys $mism_pos
    
    0 讨论(0)
  • 2020-12-28 22:54

    The fastest way to compare the strings to find differences would be to XOR each byte of them together then test for zero. If I had to do this I would just write a program in C to do the difference job rather than writing a C extension to Perl, then I would run my C program as a subprocess of Perl. The exact algorithm would depend on the length of the strings and the amount of data. However this would not take more than 100 lines of C. In fact, if you want to maximize speed, a program to XOR bytes of fixed-length strings and test for zero could be written in assembly language.

    0 讨论(0)
  • 2020-12-28 22:55

    Those look like gene sequences. If the strings are all 8-characters, and the domain of possible codes is ( A, C, G, T ) you might consider transforming the data somehow before processing it. That would give you only 65536 possible strings, so you can specialise your implementation.

    For example, you write a method that takes an 8-character string and maps it to an integer. Memoize that so that the operation will be quick. Next, write a comparison function, that given two integers, tells you how they differ. You would call this in a suitable looping construct with a numeric equality test like unless ( $a != $b ) before calling the comparison - a short circuit for identical codes if you will.

    0 讨论(0)
提交回复
热议问题