Since Python\'s string
can\'t be changed, I was wondering how to concatenate a string more efficiently?
I can write like it:
s += string
In Python >= 3.6, the new f-string is an efficient way to concatenate a string.
>>> name = 'some_name'
>>> number = 123
>>>
>>> f'Name is {name} and the number is {number}.'
'Name is some_name and the number is 123.'
If the strings you are concatenating are literals, use String literal concatenation
re.compile(
"[A-Za-z_]" # letter or underscore
"[A-Za-z0-9_]*" # letter, digit or underscore
)
This is useful if you want to comment on part of a string (as above) or if you want to use raw strings or triple quotes for part of a literal but not all.
Since this happens at the syntax layer it uses zero concatenation operators.
As @jdi mentions Python documentation suggests to use str.join
or io.StringIO
for string concatenation. And says that a developer should expect quadratic time from +=
in a loop, even though there's an optimisation since Python 2.4. As this answer says:
If Python detects that the left argument has no other references, it calls
realloc
to attempt to avoid a copy by resizing the string in place. This is not something you should ever rely on, because it's an implementation detail and because ifrealloc
ends up needing to move the string frequently, performance degrades to O(n^2) anyway.
I will show an example of real-world code that naively relied on +=
this optimisation, but it didn't apply. The code below converts an iterable of short strings into bigger chunks to be used in a bulk API.
def test_concat_chunk(seq, split_by):
result = ['']
for item in seq:
if len(result[-1]) + len(item) > split_by:
result.append('')
result[-1] += item
return result
This code can literary run for hours because of quadratic time complexity. Below are alternatives with suggested data structures:
import io
def test_stringio_chunk(seq, split_by):
def chunk():
buf = io.StringIO()
size = 0
for item in seq:
if size + len(item) <= split_by:
size += buf.write(item)
else:
yield buf.getvalue()
buf = io.StringIO()
size = buf.write(item)
if size:
yield buf.getvalue()
return list(chunk())
def test_join_chunk(seq, split_by):
def chunk():
buf = []
size = 0
for item in seq:
if size + len(item) <= split_by:
buf.append(item)
size += len(item)
else:
yield ''.join(buf)
buf.clear()
buf.append(item)
size = len(item)
if size:
yield ''.join(buf)
return list(chunk())
And a micro-benchmark:
import timeit
import random
import string
import matplotlib.pyplot as plt
line = ''.join(random.choices(
string.ascii_uppercase + string.digits, k=512)) + '\n'
x = []
y_concat = []
y_stringio = []
y_join = []
n = 5
for i in range(1, 11):
x.append(i)
seq = [line] * (20 * 2 ** 20 // len(line))
chunk_size = i * 2 ** 20
y_concat.append(
timeit.timeit(lambda: test_concat_chunk(seq, chunk_size), number=n) / n)
y_stringio.append(
timeit.timeit(lambda: test_stringio_chunk(seq, chunk_size), number=n) / n)
y_join.append(
timeit.timeit(lambda: test_join_chunk(seq, chunk_size), number=n) / n)
plt.plot(x, y_concat)
plt.plot(x, y_stringio)
plt.plot(x, y_join)
plt.legend(['concat', 'stringio', 'join'], loc='upper left')
plt.show()
You can use this(more efficient) too. (https://softwareengineering.stackexchange.com/questions/304445/why-is-s-better-than-for-concatenation)
s += "%s" %(stringfromelsewhere)
If you are concatenating a lot of values, then neither. Appending a list is expensive. You can use StringIO for that. Especially if you are building it up over a lot of operations.
from cStringIO import StringIO
# python3: from io import StringIO
buf = StringIO()
buf.write('foo')
buf.write('foo')
buf.write('foo')
buf.getvalue()
# 'foofoofoo'
If you already have a complete list returned to you from some other operation, then just use the ''.join(aList)
From the python FAQ: What is the most efficient way to concatenate many strings together?
str and bytes objects are immutable, therefore concatenating many strings together is inefficient as each concatenation creates a new object. In the general case, the total runtime cost is quadratic in the total string length.
To accumulate many str objects, the recommended idiom is to place them into a list and call str.join() at the end:
chunks = [] for s in my_strings: chunks.append(s) result = ''.join(chunks)
(another reasonably efficient idiom is to use io.StringIO)
To accumulate many bytes objects, the recommended idiom is to extend a bytearray object using in-place concatenation (the += operator):
result = bytearray() for b in my_bytes_objects: result += b
Edit: I was silly and had the results pasted backwards, making it look like appending to a list was faster than cStringIO. I have also added tests for bytearray/str concat, as well as a second round of tests using a larger list with larger strings. (python 2.7.3)
ipython test example for large lists of strings
try:
from cStringIO import StringIO
except:
from io import StringIO
source = ['foo']*1000
%%timeit buf = StringIO()
for i in source:
buf.write(i)
final = buf.getvalue()
# 1000 loops, best of 3: 1.27 ms per loop
%%timeit out = []
for i in source:
out.append(i)
final = ''.join(out)
# 1000 loops, best of 3: 9.89 ms per loop
%%timeit out = bytearray()
for i in source:
out += i
# 10000 loops, best of 3: 98.5 µs per loop
%%timeit out = ""
for i in source:
out += i
# 10000 loops, best of 3: 161 µs per loop
## Repeat the tests with a larger list, containing
## strings that are bigger than the small string caching
## done by the Python
source = ['foo']*1000
# cStringIO
# 10 loops, best of 3: 19.2 ms per loop
# list append and join
# 100 loops, best of 3: 144 ms per loop
# bytearray() +=
# 100 loops, best of 3: 3.8 ms per loop
# str() +=
# 100 loops, best of 3: 5.11 ms per loop
my use case was slight different. I had to construct a query where more then 20 fields were dynamic. I followed this approach of using format method
query = "insert into {0}({1},{2},{3}) values({4}, {5}, {6})"
query.format('users','name','age','dna','suzan',1010,'nda')
this was comparatively simpler for me instead of using + or other ways