How do I extract the list of supported Unicode characters from a TrueType or embedded OpenType font on Linux?
Is there a tool or a library I can use to process a .tt
Here is a method using the FontTools module (which you can install with something like pip install fonttools
):
#!/usr/bin/env python
from itertools import chain
import sys
from fontTools.ttLib import TTFont
from fontTools.unicode import Unicode
ttf = TTFont(sys.argv[1], 0, verbose=0, allowVID=0,
ignoreDecompileErrors=True,
fontNumber=-1)
chars = chain.from_iterable([y + (Unicode[y[0]],) for y in x.cmap.items()] for x in ttf["cmap"].tables)
print(list(chars))
# Use this for just checking if the font contains the codepoint given as
# second argument:
#char = int(sys.argv[2], 0)
#print(Unicode[char])
#print(char in (x[0] for x in chars))
ttf.close()
The script takes as argument the font path :
python checkfont.py /path/to/font.ttf
The above Janus's answer (https://stackoverflow.com/a/19438403/431528) works. But python is too slow, especially for Asian fonts. It costs minutes for a 40MB file size font on my E5 computer.
So I write a little C++ program to do that. It is depends on FreeType2(https://www.freetype.org/). It is a vs2015 project, but it is easy to port to linux for it is a console application.
Code can be found here, https://github.com/zhk/AllCodePoints For the 40MB file size Asian font, it costs about 30 ms on my E5 computer.
You can do this on Linux in Perl using the Font::TTF module.
If you want to get all characters supported by a font, you may use the following (based on Janus's answer)
from fontTools.ttLib import TTFont
def get_font_characters(font_path):
with TTFont(font_path) as font:
characters = {chr(y[0]) for x in font["cmap"].tables for y in x.cmap.items()}
return characters
Here is a POSIX[1] shell script that can print the code point and the character in a nice and easy way with the help of fc-match
which is mentioned in Neil Mayhew's answer (it can even handle up to 8-hex-digit Unicode):
#!/bin/sh
for range in $(fc-match --format='%{charset}\n' "$1"); do
for n in $(seq "0x${range%-*}" "0x${range#*-}"); do
n_hex=$(printf "%04x" "$n")
# using \U for 5-hex-digits
printf "%-5s\U$n_hex\t" "$n_hex"
count=$((count + 1))
if [ $((count % 10)) = 0 ]; then
printf "\n"
fi
done
done
printf "\n"
You can pass the font name or anything that fc-match
accepts:
$ ls-chars "DejaVu Sans"
Updated content:
I learned that subshell is very time consuming (the printf
subshell in my script). So I managed to write a improved version that is 5-10 times faster!
#!/bin/sh
for range in $(fc-match --format='%{charset}\n' "$1"); do
for n in $(seq "0x${range%-*}" "0x${range#*-}"); do
printf "%04x\n" "$n"
done
done | while read -r n_hex; do
count=$((count + 1))
printf "%-5s\U$n_hex\t" "$n_hex"
[ $((count % 10)) = 0 ] && printf "\n"
done
printf "\n"
Old version:
$ time ls-chars "DejaVu Sans" | wc
592 11269 52740
real 0m2.876s
user 0m2.203s
sys 0m0.888s
New version (the line number indicates 5910+ characters, in 0.4 seconds!):
$ time ls-chars "DejaVu Sans" | wc
592 11269 52740
real 0m0.399s
user 0m0.446s
sys 0m0.120s
End of update
Sample output (it aligns better in my st terminal
fc-query my-font.ttf
will give you a map of supported glyphs and all the locales the font is appropriate for according to fontconfig
Since pretty much all modern linux apps are fontconfig-based this is much more useful than a raw unicode list
The actual output format is discussed here http://lists.freedesktop.org/archives/fontconfig/2013-September/004915.html