问题
A common antipattern in Python is to concatenate a sequence of strings using +
in a loop. This is bad because the Python interpreter has to create a new string object for each iteration, and it ends up taking quadratic time. (Recent versions of CPython can apparently optimize this in some cases, but other implementations can\'t, so programmers are discouraged from relying on this.) \'\'.join
is the right way to do this.
However, I\'ve heard it said (including here on Stack Overflow) that you should never, ever use +
for string concatenation, but instead always use \'\'.join
or a format string. I don\'t understand why this is the case if you\'re only concatenating two strings. If my understanding is correct, it shouldn\'t take quadratic time, and I think a + b
is cleaner and more readable than either \'\'.join((a, b))
or \'%s%s\' % (a, b)
.
Is it good practice to use +
to concatenate two strings? Or is there a problem I\'m not aware of?
回答1:
There is nothing wrong in concatenating two strings with +
. Indeed it's easier to read than ''.join([a, b])
.
You are right though that concatenating more than 2 strings with +
is an O(n^2) operation (compared to O(n) for join
) and thus becomes inefficient. However this has not to do with using a loop. Even a + b + c + ...
is O(n^2), the reason being that each concatenation produces a new string.
CPython2.4 and above try to mitigate that, but it's still advisable to use join
when concatenating more than 2 strings.
回答2:
Plus operator is perfectly fine solution to concatenate two Python strings. But if you keep adding more than two strings (n > 25) , you might want to think something else.
''.join([a, b, c])
trick is a performance optimization.
回答3:
The assumption that one should never, ever use + for string concatenation, but instead always use ''.join may be a myth. It is true that using +
creates unnecessary temporary copies of immutable string object but the other not oft quoted fact is that calling join
in a loop would generally add the overhead of function call
. Lets take your example.
Create two lists, one from the linked SO question and another a bigger fabricated
>>> myl1 = ['A','B','C','D','E','F']
>>> myl2=[chr(random.randint(65,90)) for i in range(0,10000)]
Lets create two functions, UseJoin
and UsePlus
to use the respective join
and +
functionality.
>>> def UsePlus():
return [myl[i] + myl[i + 1] for i in range(0,len(myl), 2)]
>>> def UseJoin():
[''.join((myl[i],myl[i + 1])) for i in range(0,len(myl), 2)]
Lets run timeit with the first list
>>> myl=myl1
>>> t1=timeit.Timer("UsePlus()","from __main__ import UsePlus")
>>> t2=timeit.Timer("UseJoin()","from __main__ import UseJoin")
>>> print "%.2f usec/pass" % (1000000 * t1.timeit(number=100000)/100000)
2.48 usec/pass
>>> print "%.2f usec/pass" % (1000000 * t2.timeit(number=100000)/100000)
2.61 usec/pass
>>>
They have almost the same runtime.
Lets use cProfile
>>> myl=myl2
>>> cProfile.run("UsePlus()")
5 function calls in 0.001 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.001 0.001 0.001 0.001 <pyshell#1376>:1(UsePlus)
1 0.000 0.000 0.001 0.001 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {len}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.000 0.000 0.000 0.000 {range}
>>> cProfile.run("UseJoin()")
5005 function calls in 0.029 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.015 0.015 0.029 0.029 <pyshell#1388>:1(UseJoin)
1 0.000 0.000 0.029 0.029 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {len}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
5000 0.014 0.000 0.014 0.000 {method 'join' of 'str' objects}
1 0.000 0.000 0.000 0.000 {range}
And it looks that using Join, results in unnecessary function calls which could add to the overhead.
Now coming back to the question. Should one discourage the use of +
over join
in all cases?
I believe no, things should be taken into consideration
- Length of the String in Question
- No of Concatenation Operation.
And off-course in a development pre-mature optimization is evil.
回答4:
When working with multiple people, it's sometimes difficult to know exactly what's happening. Using a format string instead of concatenation can avoid one particular annoyance that's happened a whole ton of times to us:
Say, a function requires an argument, and you write it expecting to get a string:
In [1]: def foo(zeta):
...: print 'bar: ' + zeta
In [2]: foo('bang')
bar: bang
So, this function may be used pretty often throughout the code. Your coworkers may know exactly what it does, but not necessarily be fully up-to-speed on the internals, and may not know that the function expects a string. And so they may end up with this:
In [3]: foo(23)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/home/izkata/<ipython console> in <module>()
/home/izkata/<ipython console> in foo(zeta)
TypeError: cannot concatenate 'str' and 'int' objects
There would be no problem if you just used a format string:
In [1]: def foo(zeta):
...: print 'bar: %s' % zeta
...:
...:
In [2]: foo('bang')
bar: bang
In [3]: foo(23)
bar: 23
The same is true for all types of objects that define __str__
, which may be passed in as well:
In [1]: from datetime import date
In [2]: zeta = date(2012, 4, 15)
In [3]: print 'bar: ' + zeta
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/home/izkata/<ipython console> in <module>()
TypeError: cannot concatenate 'str' and 'datetime.date' objects
In [4]: print 'bar: %s' % zeta
bar: 2012-04-15
So yes: If you can use a format string do it and take advantage of what Python has to offer.
回答5:
I have done a quick test:
import sys
str = e = "a xxxxxxxxxx very xxxxxxxxxx long xxxxxxxxxx string xxxxxxxxxx\n"
for i in range(int(sys.argv[1])):
str = str + e
and timed it:
mslade@mickpc:/binks/micks/ruby/tests$ time python /binks/micks/junk/strings.py 8000000
8000000 times
real 0m2.165s
user 0m1.620s
sys 0m0.540s
mslade@mickpc:/binks/micks/ruby/tests$ time python /binks/micks/junk/strings.py 16000000
16000000 times
real 0m4.360s
user 0m3.480s
sys 0m0.870s
There is apparently an optimisation for the a = a + b
case. It does not exhibit O(n^2) time as one might suspect.
So at least in terms of performance, using +
is fine.
回答6:
According to Python docs, using str.join() will give you performance consistence across various implementations of Python. Although CPython optimizes away the quadratic behavior of s = s + t, other Python implementations may not.
CPython implementation detail: If s and t are both strings, some Python implementations such as CPython can usually perform an in-place optimization for assignments of the form s = s + t or s += t. When applicable, this optimization makes quadratic run-time much less likely. This optimization is both version and implementation dependent. For performance sensitive code, it is preferable to use the str.join() method which assures consistent linear concatenation performance across versions and implementations.
Sequence Types in Python docs (see the foot note [6])
回答7:
''.join([a, b]) is better solution than +.
Because Code should be written in a way that does not disadvantage other implementations of Python (PyPy, Jython, IronPython, Cython, Psyco, and such)
form a += b or a = a + b is fragile even in CPython and isn't present at all in implementations that don't use refcounting (reference counting is a technique of storing the number of references, pointers, or handles to a resource such as an object, block of memory, disk space or other resource)
https://www.python.org/dev/peps/pep-0008/#programming-recommendations
来源:https://stackoverflow.com/questions/10043636/any-reason-not-to-use-to-concatenate-two-strings