I recall getting a scolding for concatenating Strings in Python once upon a time. I was told that it is more efficient to create an List of Strings in Python and join them later
Yes, it's the same principle. I remember a ProjectEuler puzzle where I tried it both ways, calling join is much faster.
If you check out the Ruby source, join is implemented all in C, it's going to be a lot faster than concatenating strings (no intermediate object creation, no garbage collection):
/*
* call-seq:
* array.join(sep=$,) -> str
*
* Returns a string created by converting each element of the array to
* a string, separated by <i>sep</i>.
*
* [ "a", "b", "c" ].join #=> "abc"
* [ "a", "b", "c" ].join("-") #=> "a-b-c"
*/
static VALUE
rb_ary_join_m(argc, argv, ary)
int argc;
VALUE *argv;
VALUE ary;
{
VALUE sep;
rb_scan_args(argc, argv, "01", &sep);
if (NIL_P(sep)) sep = rb_output_fs;
return rb_ary_join(ary, sep);
}
where rb_ary_join is:
VALUE rb_ary_join(ary, sep)
VALUE ary, sep;
{
long len = 1, i;
int taint = Qfalse;
VALUE result, tmp;
if (RARRAY(ary)->len == 0) return rb_str_new(0, 0);
if (OBJ_TAINTED(ary) || OBJ_TAINTED(sep)) taint = Qtrue;
for (i=0; i<RARRAY(ary)->len; i++) {
tmp = rb_check_string_type(RARRAY(ary)->ptr[i]);
len += NIL_P(tmp) ? 10 : RSTRING(tmp)->len;
}
if (!NIL_P(sep)) {
StringValue(sep);
len += RSTRING(sep)->len * (RARRAY(ary)->len - 1);
}
result = rb_str_buf_new(len);
for (i=0; i<RARRAY(ary)->len; i++) {
tmp = RARRAY(ary)->ptr[i];
switch (TYPE(tmp)) {
case T_STRING:
break;
case T_ARRAY:
if (tmp == ary || rb_inspecting_p(tmp)) {
tmp = rb_str_new2("[...]");
}
else {
VALUE args[2];
args[0] = tmp;
args[1] = sep;
tmp = rb_protect_inspect(inspect_join, ary, (VALUE)args);
}
break;
default:
tmp = rb_obj_as_string(tmp);
}
if (i > 0 && !NIL_P(sep))
rb_str_buf_append(result, sep);
rb_str_buf_append(result, tmp);
if (OBJ_TAINTED(tmp)) taint = Qtrue;
}
if (taint) OBJ_TAINT(result);
return result;
}
Funny, benchmarking gives surprising results (unless I'm doing something wrong):
require 'benchmark'
N = 1_000_000
Benchmark.bm(20) do |rep|
rep.report('+') do
N.times do
res = 'foo' + 'bar' + 'baz'
end
end
rep.report('join') do
N.times do
res = ['foo', 'bar', 'baz'].join
end
end
rep.report('<<') do
N.times do
res = 'foo' << 'bar' << 'baz'
end
end
end
gives
jablan@poneti:~/dev/rb$ ruby concat.rb
user system total real
+ 1.760000 0.000000 1.760000 ( 1.791334)
join 2.410000 0.000000 2.410000 ( 2.412974)
<< 1.380000 0.000000 1.380000 ( 1.376663)
join
turns out to be the slowest. It might have to do with creating the array, but that's what you would have to do anyway.
Oh BTW,
jablan@poneti:~/dev/rb$ ruby -v
ruby 1.9.1p378 (2010-01-10 revision 26273) [i486-linux]
Try it yourself with the Benchmark class.
require "benchmark"
n = 1000000
Benchmark.bmbm do |x|
x.report("concatenation") do
foo = ""
n.times do
foo << "foobar"
end
end
x.report("using lists") do
foo = []
n.times do
foo << "foobar"
end
string = foo.join
end
end
This produces the following output:
Rehearsal -------------------------------------------------
concatenation 0.300000 0.010000 0.310000 ( 0.317457)
using lists 0.380000 0.050000 0.430000 ( 0.442691)
---------------------------------------- total: 0.740000sec
user system total real
concatenation 0.260000 0.010000 0.270000 ( 0.309520)
using lists 0.310000 0.020000 0.330000 ( 0.363102)
So it looks like concatenation is a little faster in this case. Benchmark on your system for your use-case.
I was just reading about this. Attahced is a link talking about it.
Building-a-String-from-Parts
From what I understand, in Python and Java strings are immutable objects unlike arrays, while in Ruby both strings and arrays are as mutable as each other. There might be a minimal difference in speed between using a String.concat or << method to form a string versus Array.join but it doesn't seem to be a big issue.
I think the link will explain this a lot better than i did.
Thanks,
Martin
" The problem is the pile of data as a whole. In his first situation, he had two types of data stockpiling: (1) a temporary string for each row in his CSV file, with fixed quotations and such things, and (2) the giant string containing everything. If each string is 1k and there are 5,000 rows...
Scenario One: build a big string from little strings
temporary strings: 5 megs (5,000k) massive string: 5 megs (5,000k) TOTAL: 10 megs (10,000k) Dave's improved script swapped the massive string for an array. He kept the temporary strings, but stored them in an array. The array will only end up costing 5000 * sizeof(VALUE) rather than the full size of each string. And generally, a VALUE is four bytes.
Scenario Two: storing strings in an array
strings: 5 megs (5,000k) massive array: 20k
Then, when we need to make a big string, we call join. Now we're up to ten megs and suddenly all those strings become temporary strings and they can all be released at once. It's a huge cost at the end, but it's a lot more efficient than a gradual crescendo that eats resources the whole time. "
http://viewsourcecode.org/why/hacking/theFullyUpturnedBin.html
^It's actually better to in the for memory/garbage collection performance to delay the operation until the end just like I was taught to in Python. The reason begin that you get one huge chunk of allocation towards the end and an instant release of objects.