Unicode formatting

爷,独闯天下 提交于 2019-12-07 12:58:25

问题


I am working with string formatting. For english the formatting is neat but for unicode characters the formatting is haphazard. Can anyone please tell me the reason? Example:

form = u'{:<15}{:<3}({})'
a = [
 u'സി ട്രീമിം',
 u'ബി ഡോഗേറ്റ്',
 u'ജെ ഹോളണ്ട്',
 u'എം നസീർ ',
 u'എം ബസ്ചാഗൻ…',
 u'ടി ഹെഡ് ',
 u'കെ ഭാരത് ',
 u'എം സിറാജ് ',
 u'എ ഈശ്വരൻ ',
 u'സി ഹാൻഡ്‌സ്‌കോംബ് ബി',]

 for i in range(0, 10):
     print form.format(a[i][:12], 1, 2)

Gives output as

While

s = [
 u'abcdef',
 u'akash',
 u'rohit',
 u'anubhav',
 u'bhargav',
 u'achut',
 u'punnet',
 u'tom',
 u'rach',
 u'kamal'
 ]
for i in range(0, 10):
     print form.format(s[i][:12], 1, 2)

Gives:


回答1:


You are printing Malayalam Unicode codepoints, which uses a lot of vowel signs to modify the preceding glyph. These vowel sign codepoints that do not themselves form a new letter, and Malayalam doesn't produce the same regular width of output in a terminal as ASCII letters would.

For example, in your first string starts with U+0D38 MALAYALAM LETTER SA and U+0D3F MALAYALAM VOWEL SIGN I. The first, letter SA, takes a full position on the screen, but the second character, the vowel sign I, when preceding by SA, alters how the letter is printed. Note how with 2 codepoints printed, there is just one visible glyph:

>>> print u'\u0d38'  # letter SA
സ
>>> print u'\u0d3f'  # vowel sign I
 ി
>>> print u'\u0d38\u0d3f'  # both together
സി

The widths of Malayalam codepoints is also different; if you add ASCII letters below SA and vowel sign I, separately and combined, it looks like this:

>>> print u'\u0d38\nA..\n\u0d3f\nB..\n\u0d38\u0d3f\nAB.'  # with ASCII letters for size
സ
A..
 ി
B..
സി
AB.

Note how is wider than A (about 2.5 times as wide), while സി is almost as wide as 3 ASCII codepoints in fixed width! Not all Malayalam letters are this wide, however. The next letter in the first example is U+0D1F MALAYALAM LETTER TTA, which is much less wide:

>>> print u'\u0d38\nA..\n\u0d1f\nB..'
സ
A..
ട
B..

In practice, I'm hoping that the difference doesn't matter and codepoints are instead combined such that the output ends up roughly the same width.

Next, Malayalam has other combining characters too; your first string has U+0D4D MALAYALAM SIGN VIRAMA, which has been combined with the preceding letter TTA.

Diacritical marks, when combined with the preceding letter, play havoc with printing width:

>>> print u'\u0d1f\nA..\n\u0d4d\nB..\n\u0d1f\u0d4d\nAB.'
ട
A..
 ്
B..
ട്
AB.

The letter TTA is just as wide as an ASCII letter, and when you add the virama sign, the width didn't actually change.

You can approximate sizes by looking at the codepoint Unicode general categories. The unicodedata.category() function gives you the category as a string:

>>> import unicodedata
>>> unicodedata.category(u'\u0d38')
'Lo'
>>> unicodedata.category(u'\u0d3f')
'Mc'
>>> unicodedata.category(u'\u0d4d')
'Mn'

The letter SA is Lo (Letter, other), the vowel sign is Mc (Mark, spacing combining), and the virama sign is Mn (Mark, nonspacing).

>>> categories = {}
>>> for c in a[0]:
...     cat = unicodedata.category(c)
...     categories[cat] = categories.get(cat, 0) + 1
... 
>>> categories
{'Lo': 4, 'Mn': 1, 'Mc': 4, 'Zs': 1}

So for the first string, there are 4 letters, 4 combining marks, and the one vowel sign. The Zs category (Separator, space) is for the ' ' ASCII space character.

Can we get their widths predicted better if we skipped Mc and Mn characters? String a[0] would be 5 characters wide (4 times Lo and 1 space):

>>> print a[0] + '\nABCDE.'
സി ട്രീമിം
ABCDE.

In the browser, that doesn't look close enough, but in my iTerm terminal window it looks like this:

To get your lines to line up, you'd have to calculate the right width for your strings to add extra spaces for the difference in display width and the number of codepoints:

import unicodedata

def malayalam_width(s):
    return sum(1 for c in s if unicodedata.category(c)[0] != 'M')

form = u'{:<{width}}{:<3}({})'
for line in a:
    line = line[:12]
    adjust = len(line) - malayalam_width(line)
    print form.format(line, 1, 2, width=15 + adjust)

This improves the output a lot already:

It appears those wider letters do make a difference after all. You'd have to manually add further width for those to get a better result; with a mapping from letter to adjusted width you could get this to align a little better again. However, the codepoint widths are set by the font you use, and I'm not sure how easy it is to find a font that uses equal width for all Malayalam letters.

I find it much easier to just use tab stops, using

form = u'{:<{width}}\t{:<3}({})'
for line in a:
    line = line[:12]
    adjust = len(line) - malayalam_width(line)
    print form.format(line, 1, 2, width=12 + adjust)

Now the numbers do line up:

You do need to keep adjusting for widths; otherwise you end up at the wrong tab stop half the time.

Caveat: I'm not at all familiar with the Malayalam script, and I'm sure to have missed subtleties about how the various letters, vowel signs and diacritical marks interact. Someone who is more familiar with the script and Unicode codepoints is probably going to be able to produce a better width approximation function than I presented here.

I've also ignored the 2 U+200C ZERO WIDTH NON-JOINER codepoints that are currently present in your last string; you may want to remove those from your data. As it's name suggests, it has no width either.




回答2:


You could use the wcwidth module, it overcomes issues where tab length is interpreted differently in various terminals (as far as I know).

I used Python 3 here, I take it you're using 2, so your mileage may vary. Also, I modified the formatting of your output to demonstrate some of the variables in use

Solution

from wcwidth import wcswidth

a = [
    u'സി ട്രീമിം',
    u'ബി ഡോഗേറ്റ്',
    u'ജെ ഹോളണ്ട്',
    u'എം നസീർ ',
    u'എം ബസ്ചാഗൻ…',
    u'ടി ഹെഡ് ',
    u'കെ ഭാരത് ',
    u'എം സിറാജ് ',
    u'എ ഈശ്വരൻ ',
    u'സി ഹാൻഡ്‌സ്‌കോംബ് ബി'
]

desired = 15
max_str = 12

for item in a:

    sub_str = item[:max_str]

    diff = len(sub_str) - wcswidth(sub_str)

    indent = desired + diff if desired - wcswidth(sub_str) > 0 else desired + diff - 1

    form = u'{:<'+ str(indent) +'} {:<3}{:<3}{:<3}'

    print (form.format(sub_str, len(sub_str), wcswidth(sub_str), indent))

Result:



来源:https://stackoverflow.com/questions/52931146/unicode-formatting

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!