Ruby - Array.join versus String Concatenation (Efficiency)

后端 未结 5 1795
面向向阳花
面向向阳花 2021-02-05 06:56

I recall getting a scolding for concatenating Strings in Python once upon a time. I was told that it is more efficient to create an List of Strings in Python and join them later

相关标签:
5条回答
  • 2021-02-05 07:10

    Yes, it's the same principle. I remember a ProjectEuler puzzle where I tried it both ways, calling join is much faster.

    If you check out the Ruby source, join is implemented all in C, it's going to be a lot faster than concatenating strings (no intermediate object creation, no garbage collection):

    /*
     *  call-seq:
     *     array.join(sep=$,)    -> str
     *  
     *  Returns a string created by converting each element of the array to
     *  a string, separated by <i>sep</i>.
     *     
     *     [ "a", "b", "c" ].join        #=> "abc"
     *     [ "a", "b", "c" ].join("-")   #=> "a-b-c"
     */
    
    static VALUE
    rb_ary_join_m(argc, argv, ary)
        int argc;
        VALUE *argv;
        VALUE ary;
    {
        VALUE sep;
    
        rb_scan_args(argc, argv, "01", &sep);
        if (NIL_P(sep)) sep = rb_output_fs;
    
        return rb_ary_join(ary, sep);
    }
    

    where rb_ary_join is:

     VALUE rb_ary_join(ary, sep)
         VALUE ary, sep;
     {
         long len = 1, i;
         int taint = Qfalse;
         VALUE result, tmp;
    
         if (RARRAY(ary)->len == 0) return rb_str_new(0, 0);
         if (OBJ_TAINTED(ary) || OBJ_TAINTED(sep)) taint = Qtrue;
    
         for (i=0; i<RARRAY(ary)->len; i++) {
         tmp = rb_check_string_type(RARRAY(ary)->ptr[i]);
         len += NIL_P(tmp) ? 10 : RSTRING(tmp)->len;
         }
         if (!NIL_P(sep)) {
         StringValue(sep);
         len += RSTRING(sep)->len * (RARRAY(ary)->len - 1);
         }
         result = rb_str_buf_new(len);
         for (i=0; i<RARRAY(ary)->len; i++) {
         tmp = RARRAY(ary)->ptr[i];
         switch (TYPE(tmp)) {
           case T_STRING:
             break;
           case T_ARRAY:
             if (tmp == ary || rb_inspecting_p(tmp)) {
             tmp = rb_str_new2("[...]");
             }
             else {
             VALUE args[2];
    
             args[0] = tmp;
             args[1] = sep;
             tmp = rb_protect_inspect(inspect_join, ary, (VALUE)args);
             }
             break;
           default:
             tmp = rb_obj_as_string(tmp);
         }
         if (i > 0 && !NIL_P(sep))
             rb_str_buf_append(result, sep);
         rb_str_buf_append(result, tmp);
         if (OBJ_TAINTED(tmp)) taint = Qtrue;
         }
    
         if (taint) OBJ_TAINT(result);
         return result;
    }
    
    0 讨论(0)
  • 2021-02-05 07:17

    Funny, benchmarking gives surprising results (unless I'm doing something wrong):

    require 'benchmark'
    
    N = 1_000_000
    Benchmark.bm(20) do |rep|
    
      rep.report('+') do
        N.times do
          res = 'foo' + 'bar' + 'baz'
        end
      end
    
      rep.report('join') do
        N.times do
          res = ['foo', 'bar', 'baz'].join
        end
      end
    
      rep.report('<<') do
        N.times do
          res = 'foo' << 'bar' << 'baz'
        end
      end
    end
    

    gives

    jablan@poneti:~/dev/rb$ ruby concat.rb 
                              user     system      total        real
    +                     1.760000   0.000000   1.760000 (  1.791334)
    join                  2.410000   0.000000   2.410000 (  2.412974)
    <<                    1.380000   0.000000   1.380000 (  1.376663)
    

    join turns out to be the slowest. It might have to do with creating the array, but that's what you would have to do anyway.

    Oh BTW,

    jablan@poneti:~/dev/rb$ ruby -v
    ruby 1.9.1p378 (2010-01-10 revision 26273) [i486-linux]
    
    0 讨论(0)
  • 2021-02-05 07:20

    Try it yourself with the Benchmark class.

    require "benchmark"
    
    n = 1000000
    Benchmark.bmbm do |x|
      x.report("concatenation") do
        foo = ""
        n.times do
          foo << "foobar"
        end
      end
    
      x.report("using lists") do
        foo = []
        n.times do
          foo << "foobar"
        end
        string = foo.join
      end
    end
    

    This produces the following output:

    Rehearsal -------------------------------------------------
    concatenation   0.300000   0.010000   0.310000 (  0.317457)
    using lists     0.380000   0.050000   0.430000 (  0.442691)
    ---------------------------------------- total: 0.740000sec
    
                        user     system      total        real
    concatenation   0.260000   0.010000   0.270000 (  0.309520)
    using lists     0.310000   0.020000   0.330000 (  0.363102)
    

    So it looks like concatenation is a little faster in this case. Benchmark on your system for your use-case.

    0 讨论(0)
  • 2021-02-05 07:25

    I was just reading about this. Attahced is a link talking about it.

    Building-a-String-from-Parts

    From what I understand, in Python and Java strings are immutable objects unlike arrays, while in Ruby both strings and arrays are as mutable as each other. There might be a minimal difference in speed between using a String.concat or << method to form a string versus Array.join but it doesn't seem to be a big issue.

    I think the link will explain this a lot better than i did.

    Thanks,

    Martin

    0 讨论(0)
  • 2021-02-05 07:32

    " The problem is the pile of data as a whole. In his first situation, he had two types of data stockpiling: (1) a temporary string for each row in his CSV file, with fixed quotations and such things, and (2) the giant string containing everything. If each string is 1k and there are 5,000 rows...

    Scenario One: build a big string from little strings

    temporary strings: 5 megs (5,000k) massive string: 5 megs (5,000k) TOTAL: 10 megs (10,000k) Dave's improved script swapped the massive string for an array. He kept the temporary strings, but stored them in an array. The array will only end up costing 5000 * sizeof(VALUE) rather than the full size of each string. And generally, a VALUE is four bytes.

    Scenario Two: storing strings in an array

    strings: 5 megs (5,000k) massive array: 20k

    Then, when we need to make a big string, we call join. Now we're up to ten megs and suddenly all those strings become temporary strings and they can all be released at once. It's a huge cost at the end, but it's a lot more efficient than a gradual crescendo that eats resources the whole time. "

    http://viewsourcecode.org/why/hacking/theFullyUpturnedBin.html

    ^It's actually better to in the for memory/garbage collection performance to delay the operation until the end just like I was taught to in Python. The reason begin that you get one huge chunk of allocation towards the end and an instant release of objects.

    0 讨论(0)
提交回复
热议问题