问题
I'd like to shorten a string using textwrap.shorten or a function like it. The string can potentially have non-ASCII characters. What's special here is that the maximal width
is for the bytes
encoding of the string. This problem is motivated by the fact that several database column definitions and some message buses have a bytes
based max length.
For example:
>>> import textwrap
>>> s = '☺ Ilsa, le méchant ☺ ☺ gardien ☺'
# Available function that I tried:
>>> textwrap.shorten(s, width=27)
'☺ Ilsa, le méchant ☺ [...]'
>>> len(_.encode())
31 # I want ⩽27
# Desired function:
>>> shorten_to_bytes_width(s, width=27)
'☺ Ilsa, le méchant [...]'
>>> len(_.encode())
27 # I want and get ⩽27
It's okay for the implementation to use a width greater than or equal to the length of the whitespace-stripped placeholder [...]
, i.e. 5.
The text should not be shortened any more than necessary. Some buggy implementations can use optimizations which on occasion result in excessive shortening.
Using textwrap.wrap with bytes count is a similar question but it's different enough from this one since it is about textwrap.wrap
, not textwrap.shorten
. Only the latter function uses a placeholder
([...]
) which makes this question sufficiently unique.
Caution: Do not rely on any of the answers here for shortening a JSON encoded string in a fixed number of bytes. For it, substitute text.encode()
with json.dumps(text)
.
回答1:
In theory it's enough to encode
your string, then check if it fits in the "width" constraint. If it does, then the string can be simply returned. Otherwise you can take the first "width" bytes from the encoded string (minus the bytes needed for the placeholder). To make sure it works like textwrap.shorten
one also needs to find the last whitespace in the remaining bytes and return everything before the whitespace + the placeholder. If there is no whitespace only the placeholder needs to be returned.
Given that you mentioned that you really want it byte-amount constrained the function throws an exception if the placeholder is too large. Because having a placeholder that wouldn't fit into the byte-constrained container/datastructure simply doesn't make sense and avoids a lot of edge cases that could result in inconsistent "maximum byte size" and "placeholder byte size".
The code could look like this:
def shorten_rsplit(string: str, maximum_bytes: int, normalize_spaces: bool = False, placeholder: str = "[...]") -> str:
# Make sure the placeholder satisfies the byte length requirement
encoded_placeholder = placeholder.encode().strip()
if maximum_bytes < len(encoded_placeholder):
raise ValueError('placeholder too large for max width')
# Get the UTF-8 bytes that represent the string and (optionally) normalize the spaces.
if normalize_spaces:
string = " ".join(string.split())
encoded_string = string.encode()
# If the input string is empty simply return an empty string.
if not encoded_string:
return ''
# In case we don't need to shorten anything simply return
if len(encoded_string) <= maximum_bytes:
return string
# We need to shorten the string, so we need to add the placeholder
substring = encoded_string[:maximum_bytes - len(encoded_placeholder)]
splitted = substring.rsplit(b' ', 1) # Split at last space-character
if len(splitted) == 2:
return b" ".join([splitted[0], encoded_placeholder]).decode()
else:
return '[...]'
And a simple test case:
t = '☺ Ilsa, le méchant ☺ ☺ gardien ☺'
for i in range(5, 50):
shortened = shorten_rsplit(t, i)
byte_length = len(shortened.encode())
print(byte_length <= i, i, byte_length, shortened)
Which returns
True 5 5 [...]
True 6 5 [...]
True 7 5 [...]
True 8 5 [...]
True 9 9 ☺ [...]
True 10 9 ☺ [...]
True 11 9 ☺ [...]
True 12 9 ☺ [...]
True 13 9 ☺ [...]
True 14 9 ☺ [...]
True 15 15 ☺ Ilsa, [...]
True 16 15 ☺ Ilsa, [...]
True 17 15 ☺ Ilsa, [...]
True 18 18 ☺ Ilsa, le [...]
True 19 18 ☺ Ilsa, le [...]
True 20 18 ☺ Ilsa, le [...]
True 21 18 ☺ Ilsa, le [...]
True 22 18 ☺ Ilsa, le [...]
True 23 18 ☺ Ilsa, le [...]
True 24 18 ☺ Ilsa, le [...]
True 25 18 ☺ Ilsa, le [...]
True 26 18 ☺ Ilsa, le [...]
True 27 27 ☺ Ilsa, le méchant [...]
True 28 27 ☺ Ilsa, le méchant [...]
True 29 27 ☺ Ilsa, le méchant [...]
True 30 27 ☺ Ilsa, le méchant [...]
True 31 31 ☺ Ilsa, le méchant ☺ [...]
True 32 31 ☺ Ilsa, le méchant ☺ [...]
True 33 31 ☺ Ilsa, le méchant ☺ [...]
True 34 31 ☺ Ilsa, le méchant ☺ [...]
True 35 35 ☺ Ilsa, le méchant ☺ ☺ [...]
True 36 35 ☺ Ilsa, le méchant ☺ ☺ [...]
True 37 35 ☺ Ilsa, le méchant ☺ ☺ [...]
True 38 35 ☺ Ilsa, le méchant ☺ ☺ [...]
True 39 35 ☺ Ilsa, le méchant ☺ ☺ [...]
True 40 35 ☺ Ilsa, le méchant ☺ ☺ [...]
True 41 41 ☺ Ilsa, le méchant ☺ ☺ gardien ☺
True 42 41 ☺ Ilsa, le méchant ☺ ☺ gardien ☺
True 43 41 ☺ Ilsa, le méchant ☺ ☺ gardien ☺
True 44 41 ☺ Ilsa, le méchant ☺ ☺ gardien ☺
True 45 41 ☺ Ilsa, le méchant ☺ ☺ gardien ☺
True 46 41 ☺ Ilsa, le méchant ☺ ☺ gardien ☺
True 47 41 ☺ Ilsa, le méchant ☺ ☺ gardien ☺
True 48 41 ☺ Ilsa, le méchant ☺ ☺ gardien ☺
True 49 41 ☺ Ilsa, le méchant ☺ ☺ gardien ☺
The function also has an argument for normalizing the spaces. That could be helpful in case you have different kind of whitespaces (newlines, etc.) or multiple sequential spaces. Although it will be a bit slower.
Performance
I did a quick test using simple_benchmark (a library I wrote) to ensure it's actually faster.
For the benchmark I create a string containing random unicode characters where the (on average) one out of 8 characters is a whitespace. I also use half the length of the string as byte-width to split. Both have no special reason, it could bias the benchmarks though, that's why I wanted to mention it.
The functions used in the benchmark:
def shorten_rsplit(string: str, maximum_bytes: int, normalize_spaces: bool = False, placeholder: str = "[...]") -> str:
encoded_placeholder = placeholder.encode().strip()
if maximum_bytes < len(encoded_placeholder):
raise ValueError('placeholder too large for max width')
if normalize_spaces:
string = " ".join(string.split())
encoded_string = string.encode()
if not encoded_string:
return ''
if len(encoded_string) <= maximum_bytes:
return string
substring = encoded_string[:maximum_bytes - len(encoded_placeholder)]
splitted = substring.rsplit(b' ', 1) # Split at last space-character
if len(splitted) == 2:
return b" ".join([splitted[0], encoded_placeholder]).decode()
else:
return '[...]'
import textwrap
_MIN_WIDTH = 5
def shorten_to_bytes_width(text: str, width: int) -> str:
width = max(_MIN_WIDTH, width)
text = textwrap.shorten(text, width)
while len(text.encode()) > width:
text = textwrap.shorten(text, len(text) - 1)
assert len(text.encode()) <= width
return text
def naive(text: str, width: int) -> str:
width = max(_MIN_WIDTH, width)
text = textwrap.shorten(text, width)
if len(text.encode()) <= width:
return text
current_width = _MIN_WIDTH
index = 0
slice_index = 0
endings = ' '
while True:
new_width = current_width + len(text[index].encode())
if new_width > width:
break
if text[index] in endings:
slice_index = index
index += 1
current_width = new_width
if slice_index:
slice_index += 1 # to include found space
text = text[:slice_index] + '[...]'
assert len(text.encode()) <= width
return text
MAX_BYTES_PER_CHAR = 4
def bytes_to_char_length(input, bytes, start=0, max_length=None):
if bytes <= 0 or (max_length is not None and max_length <= 0):
return 0
if max_length is None:
max_length = min(bytes, len(input) - start)
bytes_too_much = len(input[start:start + max_length].encode()) - bytes
if bytes_too_much <= 0:
return max_length
min_length = max(max_length - bytes_too_much, bytes // MAX_BYTES_PER_CHAR)
max_length -= (bytes_too_much + MAX_BYTES_PER_CHAR - 1) // MAX_BYTES_PER_CHAR
new_start = start + min_length
bytes_left = bytes - len(input[start:new_start].encode())
return min_length + bytes_to_char_length(input, bytes_left, new_start, max_length - min_length)
def shorten_to_bytes(input, bytes, placeholder=' [...]', start=0):
if len(input[start:start + bytes + 1].encode()) <= bytes:
return input
bytes -= len(placeholder.encode())
max_chars = bytes_to_char_length(input, bytes, start)
if max_chars <= 0:
return placeholder.strip() if bytes >= 0 else ''
w = input.rfind(' ', start, start + max_chars + 1)
if w > 0:
return input[start:w] + placeholder
else:
return input[start:start + max_chars] + placeholder
# Benchmark
from simple_benchmark import benchmark, MultiArgument
import random
def get_random_unicode(length): # https://stackoverflow.com/a/21666621/5393381
get_char = chr
include_ranges = [
(0x0021, 0x0021), (0x0023, 0x0026), (0x0028, 0x007E), (0x00A1, 0x00AC), (0x00AE, 0x00FF),
(0x0100, 0x017F), (0x0180, 0x024F), (0x2C60, 0x2C7F), (0x16A0, 0x16F0), (0x0370, 0x0377),
(0x037A, 0x037E), (0x0384, 0x038A), (0x038C, 0x038C)
]
alphabet = [
get_char(code_point) for current_range in include_ranges
for code_point in range(current_range[0], current_range[1] + 1)
]
# Add more whitespaces
for _ in range(len(alphabet) // 8):
alphabet.append(' ')
return ''.join(random.choice(alphabet) for i in range(length))
r = benchmark(
[shorten_rsplit, shorten_to_bytes, shorten_to_bytes_width, naive, bytes_to_char_length],
{2**exponent: MultiArgument([get_random_unicode(2**exponent), 2**exponent // 2]) for exponent in range(4, 15)},
"string length"
)
I also did a second benchmark excluding the shorten_to_bytes_width
function so I could benchmark even longer strings:
r = benchmark(
[shorten_rsplit, shorten_to_bytes, naive],
{2**exponent: MultiArgument([get_random_unicode(2**exponent), 2**exponent // 2]) for exponent in range(4, 20)},
"string length"
)
回答2:
This solution is inefficient, but it does appear to always work correctly and without ever shortening excessively. It serves as a canonical baseline for testing any efficient solutions.
It first shortens pretending that the text is an ASCII string; this can shorten insufficiently but never excessively. It then inefficiently shortens one character at a time, and no more than necessary.
import textwrap
_MIN_WIDTH = 5 # == len(textwrap.shorten(string.ascii_letters, len(string.ascii_letters) - 1)) == len('[...]')
def shorten_to_bytes_width(text: str, width: int) -> str:
# Ref: https://stackoverflow.com/a/56401167/
width = max(_MIN_WIDTH, width) # This prevents ValueError if width < _MIN_WIDTH
text = textwrap.shorten(text, width) # After this line, len(text.encode()) >= width
while len(text.encode()) > width:
text = textwrap.shorten(text, len(text) - 1)
assert len(text.encode()) <= width
return text
Credit: Thanks to Sanyash for an improvement.
Test
>>> s = '☺ Ilsa, le méchant ☺ ☺ gardien ☺'
>>> shorten_to_bytes_width(s, 27)
'☺ Ilsa, le méchant [...]'
>>> len(_.encode())
27
Testing a candidate answer
Any candidate answer can be tested by comparing its outputs with the outputs of my function for width
of range(50, -1, -1)
or at minimum range(50, 5, -1)
. Given a candidate
function, the code below implements the unit test:
import unittest
class TestShortener(unittest.TestCase):
def test_candidate(self):
text = '☺ Ilsa, le méchant ☺ ☺ gardien ☺'
for width in range(50, -1, -1):
with self.subTest(width=width):
self.assertEqual(shorten_to_bytes_width(text, width), candidate(text, width))
回答3:
I will propose a naive solution with a loop and checking len of encoded characters like len(text[index].encode())
. Also added timings for improvement proposed in this comment
import textwrap, timeit
_MIN_WIDTH = 5
def A_B_B(text: str, width: int) -> str:
width = max(_MIN_WIDTH, width) # This prevents ValueError if width < _MIN_WIDTH
text = textwrap.shorten(text, width) # After this line, len(text.encode()) >= width
while len(text.encode()) > width:
text = textwrap.shorten(text, len(text) - 1)
assert len(text.encode()) <= width
return text
def naive(text: str, width: int) -> str:
width = max(_MIN_WIDTH, width) # This prevents ValueError if width < TEXTWRAP_MIN_WIDTH
# textwrap.shorten does a lot of work like merging several spaces into one,
# so we will use it first
text = textwrap.shorten(text, width)
if len(text.encode()) <= width:
return text
current_width = _MIN_WIDTH # len of placeholder
index = 0
slice_index = 0 # we will do a slice on a last found space if necessary
# (to avoid slicing in a middle of a word, for example)
endings = ' ' # there also can be some more endings like \t \n
while True:
# we will use the fact that if str = str1 + str2 then
# len(str.encode()) = len(str1.encode()) + len(str2.encode())
new_width = current_width + len(text[index].encode()) # taking one more character
if new_width > width:
break
if text[index] in endings:
slice_index = index
index += 1
current_width = new_width
if slice_index: # slice_index = 0 is a special case
# when we dont go further than end of first word
slice_index += 1 # to include found space
text = text[:slice_index] + '[...]'
assert len(text.encode()) <= width
return text
s = '☺ Ilsa, le méchant ☺ ☺ gardien ☺'
n = 27
print(timeit.timeit(lambda: A_B_B(s, n), number=1000))
print(timeit.timeit(lambda: naive(s, n), number=1000))
Timings:
0.032570790994213894
0.0206866109801922
回答4:
Here is a solution that tries to solve this problem directly without playing trial and error with textwrap.shorten()
using different input strings.
It uses a recursive algorithm based on educated guesses about the minimum and maximum length of the string. Partial solutions (based on the guessed minimum length) are used to reduce the problem size quickly.
The solution has two parts:
bytes_to_char_length()
computes the maximum number of chars in a string that fit within a number of bytes (see below for examples of how it works).shorten_to_bytes()
which uses the result ofbytes_to_char_length()
to calculate the placeholder position.
MAX_BYTES_PER_CHAR = 4
def bytes_to_char_length(input, bytes_left, start=0, max_length=None):
if bytes_left <= 0 or (max_length is not None and max_length <= 0):
return 0
if max_length is None:
max_length = min(bytes_left, len(input) - start)
bytes_too_much = len(input[start:start + max_length].encode()) - bytes_left
if bytes_too_much <= 0:
return max_length
# Conservative estimate for the min_length assuming all chars at the end were
# only 1 Byte.
min_length = max(max_length - bytes_too_much, bytes_left // MAX_BYTES_PER_CHAR)
# Generous estimate for the new max_length assuming all chars at the end of
# max_string were MAX_BYTES_PER_CHAR sized.
max_length -= (bytes_too_much + MAX_BYTES_PER_CHAR - 1) // MAX_BYTES_PER_CHAR
# Now take `min_length` as a partial solution and call the function
# recursively to fill the remaining bytes.
new_start = start + min_length
bytes_left -= len(input[start:new_start].encode())
return min_length + bytes_to_char_length(input, bytes_left, new_start, max_length - min_length)
def shorten_to_bytes(text, byte_width, placeholder='', start=0):
if len(text[start:start + byte_width + 1].encode()) <= byte_width:
return text
byte_width_p = byte_width - len(placeholder.encode())
if byte_width_p <= 0:
p = placeholder.strip()
return p if len(p.encode()) <= byte_width else ''
max_chars = bytes_to_char_length(text, byte_width_p, start)
# Find rightmost whitespace if any
w = text.rfind(' ', start, start + max_chars + 1)
if w > 0:
return text[start:w] + placeholder
else:
return text[start:start + max_chars] + placeholder
Examples for how bytes_to_char_length()
works
For illustration purposes let’s assume each digit in the string is encoded to its value in bytes. So '1'
, '2'
, '3'
, '4'
take 1, 2, 3 and 4 Bytes respectively.
For bytes_to_char_length('11111', 3)
we’ll get:
max_length
is set to3
by default.- The
input[start:start + max_length] = '111'
which has 3 Bytes, sobytes_too_much = 0
- This is the exact size we were looking for so we’re done.
For bytes_to_char_length('441111', 10)
:
max_length
is set to6
input[start:start + max_length] = '441111'
with 12 Bytes, sobytes_too_much = 2
min_length
is set tomax_length - 2 == 4
. (It takes at maximum 2 characters to take up 2 Bytes).max_length
is reduced by 1 (It takes at least 1 character to take 2 Bytes).bytes_left = 0
,max_length = 1
- Recursive call immediately returns
0
because there are no bytes left. Result ismin_length + 0 == 4
.
For bytes_to_char_length('111144', 10)
:
max_length
is set to6
(as before)input[start:start + max_length] = '111144'
with 12 Bytes, sobytes_too_much = 2
min_length
is set tomax_length - 2 == 4
max_length
is reduced by 1.new_start = 4
,remaining_bytes = 6
,max_length = 1
- Recursive call:
4 + bytes_to_char_length('111144', 6, start=4, max_length=1)
input[start:start + max_length] = '4'
with 4 Bytes, sobytes_too_much = -2
- Immediately return from recursion by returning
max_length == 1
, return5
as result.
Formally it makes the following assumptions:
- Each character takes at least one Byte in the encoded string.
- Each character takes at least
MAX_BYTES_BY_CHAR
in the encoded string. - For two substrings if I split a string
s
into substringss == s1 + s2
, thens.encode() == s1.encode() + s2.encode()
Performance
- It should work smoothly even for long input strings, as it avoids copying the string.
- According to my timeit measurements it’s at about one order of magnitude faster for the simple test case.
来源:https://stackoverflow.com/questions/56401166/using-python-textwrap-shorten-for-string-but-with-bytes-width