I want printf to recognize multi-byte characters when calculating the field width so that columns line up properly... I can\'t find an answer to this problem and was wonderi
The best I can think of is:
function formatwidth
{
local STR=$1; shift
local WIDTH=$1; shift
local BYTEWIDTH=$( echo -n "$STR" | wc -c )
local CHARWIDTH=$( echo -n "$STR" | wc -m )
echo $(( $WIDTH + $BYTEWIDTH - $CHARWIDTH ))
}
printf "## %5s %*s %5s ##\n## %5s %*s %5s ##\n" \
'' $( formatwidth "*" 5 ) '*' '' \
'' $( formatwidth "•" 5 ) "•" ''
You use the *
width specifier to take the width as an argument, and calculate the width you need by adding the number of additional bytes in multibyte characters.
Note that in GNU wc, -c
returns bytes, and -m
returns (possibly multibyte) characters.
I will probably use GNU awk:
awk 'BEGIN{ printf "## %5s %5s %5s ##\n## %5s %5s %5s ##\n", "", "*", "", "", "•", "" }'
## * ##
## • ##
You can even write shell wrapper function called printf on top of awk to keep same interface:
tr2awk() {
FMT="$1"
echo -n "gawk 'BEGIN{ printf \"$FMT\""
shift
for ARG in "$@"
do echo -n ", \"$ARG\""
done
echo " }'"
}
and then override printf with simple function:
printf() { eval `tr2awk "$@"`; }
Test it:
# buggy printf binary test:
/usr/bin/printf "## %5s %5s %5s ##\n## %5s %5s %5s ##\n" '' '*' '' '' "•" ''
## * ##
## • ##
# buggy printf shell builin test:
builtin printf "## %5s %5s %5s ##\n## %5s %5s %5s ##\n" '' '*' '' '' "•" ''
## * ##
## • ##
# fixed printf function test:
printf "## %5s %5s %5s ##\n## %5s %5s %5s ##\n" '' '*' '' '' "•" ''
## * ##
## • ##
Are these the only way? There's no way to do it with printf
alone?
Well with the example from ninjalj (thx btw), I wrote a script to deal with this problem, and saved it as fprintf
in /usr/local/bin
:
#! /bin/bash
IFS=' '
declare -a Text=("${@}")
## Skip the whole thing if there are no multi-byte characters ##
if (( $(echo "${Text[*]}" | wc -c) > $(echo "${Text[*]}" | wc -m) )); then
if echo "${Text[*]}" | grep -Eq '%[#0 +-]?[0-9]+(\.[0-9]+)?[sb]'; then
IFS=$'\n'
declare -a FormatStrings=($(echo -n "${Text[0]}" | grep -Eo '%[^%]*?[bs]'))
IFS=$' \t\n'
declare -i format=0
## Check every format string ##
for fw in "${FormatStrings[@]}"; do
(( format++ ))
if [[ "$fw" =~ ^%[#0\ +-]?[1-9][0-9]*(\.[1-9][0-9]*)?[sb]$ ]]; then
(( Difference = $(echo "${Text[format]}" | wc -c) - $(echo "${Text[format]}" | wc -m) ))
## If multi-btye characters ##
if (( Difference > 0 )); then
## If a field width is entered then replace field width value ##
if [[ "$fw" =~ ^%[#0\ +-]?[1-9][0-9]* ]]; then
(( Width = $(echo -n "$fw" | gsed -re 's|^%[#0 +-]?([1-9][0-9]*).*[bs]|\1|') + Difference ))
declare -a Text[0]="$(echo -n "${Text[0]}" | gsed -rne '1h;1!H;${g;y|\n|\x1C|;s|(%[^%])|\n\1|g;p}' | gsed -rne $(( format + 1 ))'s|^(%[#0 +-]?)[1-9][0-9]*|\1'${Width}'|;1h;1!H;${g;s|\n||g;y|\x1C|\n|;p}')"
fi
## If a precision is entered then replace precision value ##
if [[ "$fw" =~ \.[1-9][0-9]*[sb]$ ]]; then
(( Precision = $(echo -n "$fw" | gsed -re 's|^%.*\.([1-9][0-9]*)[sb]$|\1|') + Difference ))
declare -a Text[0]="$(echo -n "${Text[0]}" | gsed -rne '1h;1!H;${g;y|\n|\x1C|;s|(%[^%])|\n\1|g;p}' | gsed -rne $(( format + 1 ))'s|^(%[#0 +-]?([1-9][0-9]*)?)\.[1-9][0-9]*([bs])|\1.'${Precision}'\3|;1h;1!H;${g;s|\n||g;y|\x1C|\n|;p}')"
fi
fi
fi
done
fi
fi
printf "${Text[@]}"
exit 0
Usage: fprintf "## %5s %5s %5s ##\n## %5s %5s %5s ##\n" '' '*' '' '' '•' ''
A few things to note:
*
(asterisk) values for formats because I never use them. I wrote this for me and didn't want to over-complicate things.%s
and %b
as they seem to be the only ones that are affected by this problem. Thus, if somehow someone manages to get a multi-byte unicode character out of a number, it may not work without minor modification.printf
(not some old-skooler UNIX hacker), feel free to modify, or use as is all!A language like python will probably solve your problems in a simpler, more controllable way...
#!/usr/bin/python
# coding=utf-8
import sys
import codecs
import unicodedata
out = codecs.getwriter('utf-8')(sys.stdout)
def width(string):
return sum(1+(unicodedata.east_asian_width(c) in "WF")
for c in string)
a1=[u'する', u'します', u'trazan', u'した', u'しました']
a2=[u'dipsy', u'laa-laa', u'banarne', u'po', u'tinky winky']
for i,j in zip(a1,a2):
out.write('%s %s: %s\n' % (i, ' '*(12-width(i)), j))
A pure shell solution
right_justify() {
# parameters: field_width string
local spaces questions
spaces=''
questions=''
while [ "${#questions}" -lt "$1" ]; do
spaces=$spaces" "
questions=$questions?
done
result=$spaces$2
result=${result#"${result%$questions}"}
}
Note that this still does not work in dash because dash has no locale support.
This is kind of late, but I just came across this, and thought I would post it for others coming across the same post. A variation to @ninjalj's answer might be to create a function that returns a string of a given length rather than calculate the required format length:
#!/bin/bash
function sized_string
{
STR=$1; WIDTH=$2
local BYTEWIDTH=$( echo -n "$STR" | wc -c )
local CHARWIDTH=$( echo -n "$STR" | wc -m )
FMT_WIDTH=$(( $WIDTH + $BYTEWIDTH - $CHARWIDTH ))
printf "%*s" $FMT_WIDTH $STR
}
printf "[%s]\n" "$(sized_string "abc" 20)"
printf "[%s]\n" "$(sized_string "ab•cd" 20)"
which outputs:
[ abc]
[ ab•cd]