Find characters that are similar glyphically in Unicode?

问题

Lets say I have the characters Ú, Ù, Ü. All of them are similar glyphically to the English U.

Is there some list or algorithm to do this:

Given a Ú or Ù or Ü return the English U
Given a English U, return the list of all U-similar characters

I'm not sure if the code point of the Unicode characters is the same across all fonts? If it is, I suppose there could be some easy way and efficient to do this?

UPDATE

If you're using Ruby, there is a gem available unicode-confusable for this that may help in some cases.

回答1:

This won't work for all conditions, but one way to get rid of most accents is to convert the characters to their decomposed form, then throw away the combining accents:

# coding: utf8
import unicodedata as ud
s=u'U, Ù, Ú, Û, Ü, Ũ, Ū, Ŭ, Ů, Ű, Ų, Ư, Ǔ, Ǖ, Ǘ, Ǚ, Ǜ, Ụ, Ủ, Ứ, Ừ, Ử, Ữ, Ự'
print ud.normalize('NFD',s).encode('ascii','ignore')

Output

U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U

To find accent characters, use something like:

import unicodedata as ud
import string

def asc(unichr):
    return ud.normalize('NFD',unichr).encode('ascii','ignore')

U = u''.join(unichr(i) for i in xrange(65536))
for c in string.letters:
    print u''.join(u for u in U if asc(u) == c)

Output

aàáâãäåāăąǎǟǡǻȁȃȧḁạảấầẩẫậắằẳẵặ
bḃḅḇ
cçćĉċčḉ
dďḋḍḏḑḓ
eèéêëēĕėęěȅȇȩḕḗḙḛḝẹẻẽếềểễệ
fḟ
 :
etc.

回答2:

It is very unclear what you are asking to do here.

There are characters whose canonical decompositions all start with the same base character: e, é, ê, ë, ē, ĕ, ė, ę, ě, ȅ, ȇ, ȩ, ḕ, ḗ, ḙ, ḛ, ḝ, ẹ, ẻ, ẽ, ế, ề, ể, ễ, ệ, e̳, … or s, ś, ŝ, ş, š, ș, ṡ, ṣ, ṥ, ṧ, ṩ, ….
There are characters whose compatibility decompositions all include a particular character: ᵉ, ₑ, ℯ, ⅇ, ⒠, ⓔ, ㋍, ㋎, ｅ, … or s, ſ, ˢ, ẛ, ₨, ℁, ⒮, ⓢ, ㎧, ㎨, ㎮, ㎯, ㎰, ㎱, ㎲, ㎳, ㏛, ﬅ, ﬆ, ｓ, … or R, ᴿ, ₨, ℛ, ℜ, ℝ, Ⓡ, ㏚, Ｒ, ….
There are characters that just happen to look alike in some fonts: ß and β and ϐ, or 3 and Ʒ and Ȝ and ȝ and ʒ and ӡ and ᴣ, or ɣ and ɤ and γ, or F and Ϝ and ϝ, or B and Β and В, or ∅ and ○ and 0 and O and ০ and ੦ and ౦ and ૦, or 1 and l and I and Ⅰ and ᛁ and | and ǀ and ∣, ….
Characters that are the same case-insensitively, like s and S and ſ, or ss and Ss and SS and ß and ẞ, ….
Characters that all have the same numeric value, like all these for the value 1: 1¹١۱߁१১੧૧୧௧౧౹౼೧൧๑໑༡၁႑፩១៱᠑᥇᧑᧚᪁᪑᭑᮱᱁᱑₁⅟ ① ⑴ ⒈ ⓵ ❶➀➊꘡꣑꤁꧑꩑꯱𐄇𐅂𐅘𐅙𐅚𐌠𐏑𐒡𐡘𐤖𐩀𐩽𐭘𐭸𐹠𒐕𒐞𒐬𒐴𒑏𒑘𝍠𝟏𝟙𝟣𝟭𝟷 🄂 Ⅰⅰꛦ㆒㈠㊀𑁒𑁧.
Characters that all have the same primary collation strength, like all these that are the same as d: DdÐðĎďĐđ◌ͩᴰᵈᶞ◌ᷘ◌ᷙḊḋḌḍḎḏḐḑḒḓⅅⅆⅮⅾ Ⓓ ⓓ ꝹꝺＤｄ𝐃𝐝𝐷𝑑𝑫𝒅𝒟𝒹𝓓𝓭𝔇𝔡𝔻𝕕𝕯𝖉𝖣𝖽𝗗𝗱𝘋𝘥𝘿𝙙𝙳𝚍 🄳 🅓 🅳 🇩 . Note that some of those are not accessible through any kind of decomposition, but only through the DUCET/UCA values; for example, the fairly common ð or the newish ꝺ can be equated to d only through a primary UCA strength comparison; same with ƶ and z, ȼ and c, etc.
Characters that are same in certain locales, like æ and ae, or ä and ae, or ä and aa, or MacKinley and McKinley, …. Note that locale can make a really big difference, since in some locales both c and ç are the same character while in others they are not; similarly for n and ñ, or a and á and ã, ….

Some of these can be handled. Some cannot. All require different approaches depending on different needs.

What is your real goal?

回答3:

Why not just compare glyphs with something like this?

package similarglyphcharacterdetector;

import java.awt.Color;
import java.awt.Font;
import java.awt.Graphics2D;
import java.awt.Rectangle;
import java.awt.font.FontRenderContext;
import java.awt.image.BufferedImage;
import java.util.HashMap;
import java.util.LinkedHashMap;
import java.util.Map;

public class SimilarGlyphCharacterDetector {

    static char[] TEST_CHARS = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890".toCharArray();
    static BufferedImage[] SAMPLES = null;

    public static BufferedImage drawGlyph(Font font, String string) {
        FontRenderContext frc = ((Graphics2D) new BufferedImage(1, 1, BufferedImage.TYPE_BYTE_GRAY).getGraphics()).getFontRenderContext();

        Rectangle r= font.getMaxCharBounds(frc).getBounds();

        BufferedImage res = new BufferedImage(r.width, r.height, BufferedImage.TYPE_BYTE_GRAY);
        Graphics2D g = (Graphics2D) res.getGraphics();
        g.setBackground(Color.WHITE);
        g.fillRect(0, 0, r.width, r.height);
        g.setPaint(Color.BLACK);
        g.setFont(font);
        g.drawString(string, 0, r.height - font.getLineMetrics(string, g.getFontRenderContext()).getDescent());
        return res;
    }

    private static void drawSamples(Font f) {
        SAMPLES = new BufferedImage[TEST_CHARS.length];
        for (int i = 0; i < TEST_CHARS.length; i++)
            SAMPLES[i] = drawGlyph(f, String.valueOf(TEST_CHARS[i]));
    }

    private static int compareImages(BufferedImage img1, BufferedImage img2) {
        if (img1.getWidth() != img2.getWidth() || img1.getHeight() != img2.getHeight())
            throw new IllegalArgumentException();
        int d = 0;
        for (int y = 0; y < img1.getHeight(); y++) {
            for (int x = 0; x < img1.getWidth(); x++) {
                if (img1.getRGB(x, y) != img2.getRGB(x, y))
                    d++;
            }
        }
        return d;
    }

    private static int nearestSampleIndex(BufferedImage image, int maxDistance) {
        int best = Integer.MAX_VALUE;
        int bestIdx = -1;
        for (int i = 0; i < SAMPLES.length; i++) {
            int diff = compareImages(image, SAMPLES[i]);
            if (diff < best) {
                best = diff;
                bestIdx = i;
            }
        }
        if (best > maxDistance)
            return -1;
        return bestIdx;
    }

    public static void main(String[] args) throws Exception {
        Font f = new Font("FreeMono", Font.PLAIN, 13);
        drawSamples(f);
        HashMap<Character, StringBuilder> res = new LinkedHashMap<Character, StringBuilder>();
        for (char c : TEST_CHARS)
            res.put(c, new StringBuilder(String.valueOf(c)));
        int maxDistance = 5;
        for (int i = 0x80; i <= 0xFFFF; i++) {
            char c = (char)i;
            if (f.canDisplay(c)) {
                int n = nearestSampleIndex(drawGlyph(f, String.valueOf(c)), maxDistance);
                if (n != -1) {
                    char nc = TEST_CHARS[n];
                    res.get(nc).append(c);
                }
            }
        }
        for (Map.Entry<Character, StringBuilder> entry : res.entrySet())
            if (entry.getValue().length() > 1)
                System.out.println(entry.getValue());
    }
}

Output:

AÀÁÂÃÄÅĀĂĄǍǞȀȦΆΑΛАѦӒẠẢἈἉᾸᾹᾺᾼ₳Å
BƁƂΒБВЬḂḄḆ
CĆĈĊČƇΓЄГСὉℂⅭ
...

来源：https://stackoverflow.com/questions/4846365/find-characters-that-are-similar-glyphically-in-unicode

标签

unicode

glyph