strstr faster than algorithms?

后端 未结 4 676
一向
一向 2020-12-24 03:42

I have a file that\'s 21056 bytes.

I\'ve written a program in C that reads the entire file into a buffer, and then uses multiple search algorithms to search the file

相关标签:
4条回答
  • 2020-12-24 03:49

    Without seeing your code, it's hard to say exactly. strstr is heavily optimized, and usually written in assembly language. It does things like reading data 4 bytes at a time and comparing them (bit-twiddling if necessary if the alignment isn't right) to minimize memory latency. It can also take advantage of things like SSE to load 16 bytes at a time. If your code is only loading one byte at a time, it's probably getting killed by memory latency.

    Use your debugger and step through the disassembly of strstr -- you'll probably find some interesting things in there.

    0 讨论(0)
  • 2020-12-24 03:53

    Why do you think strstr should be slower than all the others? Do you know what algorithm strstr uses? I think it's quite likely that strstr uses a fine-tuned, processor-specific, assembly-coded algorithm of the KMP type or better. In which case you don't stand a chance of out-performing it in C for such small benchmarks.

    (The reason I think this is likely is that programmers love to implement such things.)

    0 讨论(0)
  • 2020-12-24 03:56

    Imagine you want to get something cleaned. You could just clean it yourself, or you could hire ten professional cleaners to clean it. If the cleaning job is an office building, the latter solution would be preferable. If the cleaning job was one window, the former would be preferable.

    You never get any payback for the time spent setting up to do the job efficiently because the job doesn't take very long.

    0 讨论(0)
  • 2020-12-24 04:08

    Horspool, KMP et al are optimal at minimizing the number of byte-comparisons.

    However, that's not the bottleneck on a modern processor. On an x86/64 processor, your string is being loaded into L1 cache in cache-line-width chunks (typically 64 bytes). No matter how clever your algorithm is, unless it gives you strides that are larger than that, you gain nothing; and the more complicated Horspool code is (at least one table lookup) can't compete.

    Furthermore, you're stuck with the "C" string constraint of null-termination: SOMEWHERE the code has to examine every byte.

    strstr() is expected to be optimal for a wide range of cases; e.g. searching for tiny strings like "\r\n" in a short string, as well as much longer ones where some smarter algorithm might have a hope. The basic strchr/memcmp loop is pretty hard to beat across the whole range of likely inputs.

    Pretty much all x86-compatible processors since 2003 have supported SSE2. If you disassembled strlen()/x86 for glibc, you may have noticed that it uses some SSE2 PCMPEQ and MOVMASK operations to search for the null terminator 16 bytes at a time. The solution is so efficient that it beats the obvious super-simple loop, for anything longer than the empty string.

    I took that idea and came up with a strstr() that beats glibc's strstr() for all cases greater than 1 byte --- where the relative difference is pretty much moot. If you're interested, check out:

    • Convergence SSE2 and strstr()

    • A better strstr() with no ASM code

      If you care to see a non-SSE2 solution that dominates strstr() for target strings of more than 15 bytes, check out:

      which makes use of multibyte comparisons rather than strchr(), to find a point at which to do a memcmp.

    BTW you've probably figured out by now that the x86 REP SCASB/REP CMPSB ops fall on their ass for anything longer than 32 bytes, and are not much improvement for shorter strings. Wish Intel had devoted a little more attention to that, than to adding SSE4.2 "string" ops.

    For strings large enough to matter, my perf tests show that BNDM is better than Horspool, across the board. BNDM is more tolerant of "pathological" cases, such as targets that heavily repeat the last byte of a pattern. BNDM can also make use of SSE2 (128bit registers) in a way that competes with 32bit registers in efficiency and start-up cost. Source code here.

    0 讨论(0)
提交回复
热议问题