How can I handle these weird special characters messing my print formatting?

问题

I am printing a formatted table. But sometimes these user generated characters are taking more than one character width and it messes up the formatting as you can see in the screenshot below...

The width of the "title" column is formatted to be 68 bytes. But these "special characters" are taking up more than 1 character width but are only counted as 1 character. This pushes the column past its bounds.

print('{0:16s}{3:<18s}{1:68s}{2:>8n}'.format((
    ' ' + streamer['user_name'][:12] + '..') if len(streamer['user_name']) > 12 else ' ' + streamer['user_name'],
    (streamer['title'].strip()[:62] + '..') if len(streamer['title']) > 62 else streamer['title'].strip(),
    streamer['viewer_count'],
    (gamesDic[streamer['game_id']][:15] + '..') if len(gamesDic[streamer['game_id']]) > 15 else gamesDic[streamer['game_id']]))

Any advice on how to deal with these special characters?

edit: I printed the offending string to file.

🔴 𝐀𝐒𝐌𝐑 (𝙪𝙥 𝙘𝙡𝙤𝙨𝙚) ✨ ＬＩＶＥ 🔔 SUBS GET SNAPCHAT

edit2:

Why do these not align on a character boundary?

edit3:

Today the first two characters are producing weird output. But the columns are aligned in each case below.

First character in isolation...

title[0]

Second character in isolation... title[1]

First and second character together.. title[0] + title[1]

回答1:

I've written custom string formatter based on @snakecharmerb`s comment but still "half character width" problem persist:

import unicodedata

def fstring(string, max_length, align='l'):
    string = str(string)
    extra_length = 0
    for char in string:
        if unicodedata.east_asian_width(char) == 'F':
            extra_length += 1

    diff = max_length - len(string) - extra_length
    if diff > 0:
        return string + diff * ' ' if align == 'l' else diff * ' ' + string
    elif diff < 0:
        return string[:max_length-3] + '.. '

    return string

data = [{'user_name': 'shroud', 'game_id': 'Apex Legends', 'title': 'pathfinder twitch prime loot YAYA @shroud on socials for update', 'viewer_count': 66200},
        {'user_name': 'Amouranth', 'game_id': 'ASMR', 'title': '🔴 𝐀𝐒𝐌𝐑 (𝙪𝙥 𝙘𝙡𝙤𝙨𝙚) ✨ ＬＩＶＥ 🔔 SUBS GET SNAPCHAT', 'viewer_count': 2261}]

for d in data:
    name = fstring(d['user_name'], 20)
    game_id = fstring(d['game_id'], 15)
    title = fstring(d['title'], 62)
    count = fstring(d['viewer_count'], 10, align='r')
    print('{}{}{}{}'.format(name, game_id, title, count))

It produces output:

(can't post it as a text since formatting will be lost)

回答2:

I made this comment to the question:

The characters in "ＬＩＶＥ" are Fullwidth characters. A hacky way to deal with them might be to test their width with unicodedata.east_asian_width(char) (it will return "F" for fullwidth characters) and substitute with the final character of unicodedata.name(char) (or just count them as length 2)

This "answer" is essentially another comment, but too long for the comment field.

This hack - as implemented in Alderven's answer - almost works for the OP, but the example string is rendered with an extra half a character width (note the example string does not contain any East Asian halfwidth characters.).

I am unable to reproduce this exact behaviour, using this test statement, where s is the example string from the question, varying the characters removed:

print((s + (68 - (len(s) + sum(1 for x in s if ud.east_asian_width(x) in ('F', 'N', 'W')))) * 'x')+ '\n'+ ('x' * 68))

In a Python 3.6 interpreter in a Gnome terminal on Debian, using the default monospace regular font, removing full-width characters causes the example string to apparently render three characters longer than the equivalent string of "x"s.

Removing full-width and wide (East Asian Width "W") characters produced a string that appeared to render the same length as the equivalent number of "x"s.

In a Python 3.7 KDE Konsole terminal on OpenSuse, using Ubuntu Monospace regular font, I could not produce a string that rendered the same length regardless of the combination of full-width, wide or neutral ("N") characters that I removed.

I did notice that in the sparkles character (✨) seemed to take up an extra half a width when rendered alone in Konsole, but could not see any half a width differences when testing the full string.

I suspect that the problem is low-level rendering outside the control of Python , as this note on the Unicode standard suggests:

Note: The East_Asian_Width property is not intended for use by modern terminal emulators without appropriate tailoring on a case-by-case basis. Such terminal emulators need a way to resolve the halfwidth/fullwidth dichotomy that is necessary for such environments, but the East_Asian_Width property does not provide an off-the-shelf solution for all situations. The growing repertoire of the Unicode Standard has long exceeded the bounds of East Asian legacy character encodings, and terminal emulations often need to be customized to support edge cases and for changes in typographical behavior over time.

来源：https://stackoverflow.com/questions/54821769/how-can-i-handle-these-weird-special-characters-messing-my-print-formatting

标签

python

unicode

terminal

string-formatting

python-unicode