Fastest way to check if a string matches a regexp in ruby?

前端 未结 7 1467
独厮守ぢ
独厮守ぢ 2020-12-13 05:07

What is the fastest way to check if a string matches a regular expression in Ruby?

My problem is that I have to \"egrep\" through a huge list of strings to find whic

相关标签:
7条回答
  • 2020-12-13 05:45

    What about re === str (case compare)?

    Since it evaluates to true or false and has no need for storing matches, returning match index and that stuff, I wonder if it would be an even faster way of matching than =~.


    Ok, I tested this. =~ is still faster, even if you have multiple capture groups, however it is faster than the other options.

    BTW, what good is freeze? I couldn't measure any performance boost from it.

    0 讨论(0)
  • 2020-12-13 05:49

    Starting with Ruby 2.4.0, you may use RegExp#match?:

    pattern.match?(string)
    

    Regexp#match? is explicitly listed as a performance enhancement in the release notes for 2.4.0, as it avoids object allocations performed by other methods such as Regexp#match and =~:

    Regexp#match?
    Added Regexp#match?, which executes a regexp match without creating a back reference object and changing $~ to reduce object allocation.

    0 讨论(0)
  • 2020-12-13 05:57

    This is the benchmark I have run after finding some articles around the net.

    With 2.4.0 the winner is re.match?(str) (as suggested by @wiktor-stribiżew), on previous versions, re =~ str seems to be fastest, although str =~ re is almost as fast.

    #!/usr/bin/env ruby
    require 'benchmark'
    
    str = "aacaabc"
    re = Regexp.new('a+b').freeze
    
    N = 4_000_000
    
    Benchmark.bm do |b|
        b.report("str.match re\t") { N.times { str.match re } }
        b.report("str =~ re\t")    { N.times { str =~ re } }
        b.report("str[re]  \t")    { N.times { str[re] } }
        b.report("re =~ str\t")    { N.times { re =~ str } }
        b.report("re.match str\t") { N.times { re.match str } }
        if re.respond_to?(:match?)
            b.report("re.match? str\t") { N.times { re.match? str } }
        end
    end
    

    Results MRI 1.9.3-o551:

    $ ./bench-re.rb  | sort -t $'\t' -k 2
           user     system      total        real
    re =~ str         2.390000   0.000000   2.390000 (  2.397331)
    str =~ re         2.450000   0.000000   2.450000 (  2.446893)
    str[re]           2.940000   0.010000   2.950000 (  2.941666)
    re.match str      3.620000   0.000000   3.620000 (  3.619922)
    str.match re      4.180000   0.000000   4.180000 (  4.180083)
    

    Results MRI 2.1.5:

    $ ./bench-re.rb  | sort -t $'\t' -k 2
           user     system      total        real
    re =~ str         1.150000   0.000000   1.150000 (  1.144880)
    str =~ re         1.160000   0.000000   1.160000 (  1.150691)
    str[re]           1.330000   0.000000   1.330000 (  1.337064)
    re.match str      2.250000   0.000000   2.250000 (  2.255142)
    str.match re      2.270000   0.000000   2.270000 (  2.270948)
    

    Results MRI 2.3.3 (there is a regression in regex matching, it seems):

    $ ./bench-re.rb  | sort -t $'\t' -k 2
           user     system      total        real
    re =~ str         3.540000   0.000000   3.540000 (  3.535881)
    str =~ re         3.560000   0.000000   3.560000 (  3.560657)
    str[re]           4.300000   0.000000   4.300000 (  4.299403)
    re.match str      5.210000   0.010000   5.220000 (  5.213041)
    str.match re      6.000000   0.000000   6.000000 (  6.000465)
    

    Results MRI 2.4.0:

    $ ./bench-re.rb  | sort -t $'\t' -k 2
           user     system      total        real
    re.match? str     0.690000   0.010000   0.700000 (  0.682934)
    re =~ str         1.040000   0.000000   1.040000 (  1.035863)
    str =~ re         1.040000   0.000000   1.040000 (  1.042963)
    str[re]           1.340000   0.000000   1.340000 (  1.339704)
    re.match str      2.040000   0.000000   2.040000 (  2.046464)
    str.match re      2.180000   0.000000   2.180000 (  2.174691)
    
    0 讨论(0)
  • 2020-12-13 06:02

    What I am wondering is if there is any strange way to make this check even faster, maybe exploiting some strange method in Regexp or some weird construct.

    Regexp engines vary in how they implement searches, but, in general, anchor your patterns for speed, and avoid greedy matches, especially when searching long strings.

    The best thing to do, until you're familiar with how a particular engine works, is to do benchmarks and add/remove anchors, try limiting searches, use wildcards vs. explicit matches, etc.

    The Fruity gem is very useful for quickly benchmarking things, because it's smart. Ruby's built-in Benchmark code is also useful, though you can write tests that fool you by not being careful.

    I've used both in many answers here on Stack Overflow, so you can search through my answers and will see lots of little tricks and results to give you ideas of how to write faster code.

    The biggest thing to remember is, it's bad to prematurely optimize your code before you know where the slowdowns occur.

    0 讨论(0)
  • 2020-12-13 06:08

    To complete Wiktor Stribiżew and Dougui answers I would say that /regex/.match?("string") about as fast as "string".match?(/regex/).

    Ruby 2.4.0 (10 000 000 ~2 sec)

    2.4.0 > require 'benchmark'
     => true 
    2.4.0 > Benchmark.measure{ 10000000.times { /^CVE-[0-9]{4}-[0-9]{4,}$/.match?("CVE-2018-1589") } }
     => #<Benchmark::Tms:0x005563da1b1c80 @label="", @real=2.2060338060000504, @cstime=0.0, @cutime=0.0, @stime=0.04000000000000001, @utime=2.17, @total=2.21> 
    2.4.0 > Benchmark.measure{ 10000000.times { "CVE-2018-1589".match?(/^CVE-[0-9]{4}-[0-9]{4,}$/) } }
     => #<Benchmark::Tms:0x005563da139eb0 @label="", @real=2.260814556000696, @cstime=0.0, @cutime=0.0, @stime=0.010000000000000009, @utime=2.2500000000000004, @total=2.2600000000000007> 
    

    Ruby 2.6.2 (100 000 000 ~20 sec)

    irb(main):001:0> require 'benchmark'
    => true
    irb(main):005:0> Benchmark.measure{ 100000000.times { /^CVE-[0-9]{4}-[0-9]{4,}$/.match?("CVE-2018-1589") } }
    => #<Benchmark::Tms:0x0000562bc83e3768 @label="", @real=24.60139879199778, @cstime=0.0, @cutime=0.0, @stime=0.010000999999999996, @utime=24.565644999999996, @total=24.575645999999995>
    irb(main):004:0> Benchmark.measure{ 100000000.times { "CVE-2018-1589".match?(/^CVE-[0-9]{4}-[0-9]{4,}$/) } }
    => #<Benchmark::Tms:0x0000562bc846aee8 @label="", @real=24.634255946999474, @cstime=0.0, @cutime=0.0, @stime=0.010046, @utime=24.598276, @total=24.608321999999998>
    

    Note: times varies, sometimes /regex/.match?("string") is faster and sometimes "string".match?(/regex/), the differences maybe only due to the machine activity.

    0 讨论(0)
  • 2020-12-13 06:11

    Depending on how complicated your regular expression is, you could possibly just use simple string slicing. I'm not sure about the practicality of this for your application or whether or not it would actually offer any speed improvements.

    'testsentence'['stsen']
    => 'stsen' # evaluates to true
    'testsentence'['koala']
    => nil # evaluates to false
    
    0 讨论(0)
提交回复
热议问题