Is there a list of characters that look similar to English letters?

Deadly 提交于 2019-11-30 10:29:35

问题


I’m having a crack at profanity filtering for a web forum written in Python.

As part of that, I’m attempting to write a function that takes a word, and returns all possible mock spellings of that word that use visually similar characters in place of specific letters (e.g. s†å©køv€rƒ|øw).

I expect I’ll have to expand this list over time to cover people’s creativity, but is there a list floating around anywhere on the internet that I could use as a starting point?


回答1:


This is probably both vastly more deep than you need, yet not wide enough to cover your use case, but the Unicode consortium have had to deal with attacks against internationalised domain names and came up with this list of homographs (characters with the same or similar rendering):

http://www.unicode.org/Public/security/latest/confusables.txt

Might make a starting point at least.




回答2:


http://en.wikipedia.org/wiki/Letterlike_Symbols

It's much much much less comprehensive but is more comprehensible.




回答3:


I created a python class to do exactly this, based on Robin's unicode link for "confusables"

https://github.com/wanderingstan/Confusables

For example, "Hello" would get expanded into the following set of regexp character classes:

[H\H\ℋ\ℌ\ℍ\𝐇\𝐻\𝑯\𝓗\𝕳\𝖧\𝗛\𝘏\𝙃\𝙷\Η\𝚮\𝛨\𝜢\𝝜\𝞖\Ⲏ\Н\Ꮋ\ᕼ\ꓧ\𐋏\Ⱨ\Ң\Ħ\Ӊ\Ӈ] [e\℮\e\ℯ\ⅇ\𝐞\𝑒\𝒆\𝓮\𝔢\𝕖\𝖊\𝖾\𝗲\𝘦\𝙚\𝚎\ꬲ\е\ҽ\ɇ\ҿ] [l\‎\|\∣\⏽\│1\‎\۱\𐌠\‎\𝟏\𝟙\𝟣\𝟭\𝟷I\I\Ⅰ\ℐ\ℑ\𝐈\𝐼\𝑰\𝓘\𝕀\𝕴\𝖨\𝗜\𝘐\𝙄\𝙸\Ɩ\l\ⅼ\ℓ\𝐥\𝑙\𝒍\𝓁\𝓵\𝔩\𝕝\𝖑\𝗅\𝗹\𝘭\𝙡\𝚕\ǀ\Ι\𝚰\𝛪\𝜤\𝝞\𝞘\Ⲓ\І\Ӏ\‎\‎\‎\‎\‎\‎\‎\‎\ⵏ\ᛁ\ꓲ\𖼨\𐊊\𐌉\‎\‎\ł\ɭ\Ɨ\ƚ\ɫ\‎\‎\‎\‎\ŀ\Ŀ\ᒷ\🄂\⒈\‎\⒓\㏫\㋋\㍤\⒔\㏬\㍥\⒕\㏭\㍦\⒖\㏮\㍧\⒗\㏯\㍨\⒘\㏰\㍩\⒙\㏱\㍪\⒚\㏲\㍫\lj\IJ\‖\∥\Ⅱ\ǁ\‎\𐆙\⒒\Ⅲ\𐆘\㏪\㋊\㍣\Ю\⒑\㏩\㋉\㍢\ʪ\₶\Ⅳ\Ⅸ\ɮ\ʫ\㏠\㋀\㍙] [l\‎\|\∣\⏽\│1\‎\۱\𐌠\‎\𝟏\𝟙\𝟣\𝟭\𝟷I\I\Ⅰ\ℐ\ℑ\𝐈\𝐼\𝑰\𝓘\𝕀\𝕴\𝖨\𝗜\𝘐\𝙄\𝙸\Ɩ\l\ⅼ\ℓ\𝐥\𝑙\𝒍\𝓁\𝓵\𝔩\𝕝\𝖑\𝗅\𝗹\𝘭\𝙡\𝚕\ǀ\Ι\𝚰\𝛪\𝜤\𝝞\𝞘\Ⲓ\І\Ӏ\‎\‎\‎\‎\‎\‎\‎\‎\ⵏ\ᛁ\ꓲ\𖼨\𐊊\𐌉\‎\‎\ł\ɭ\Ɨ\ƚ\ɫ\‎\‎\‎\‎\ŀ\Ŀ\ᒷ\🄂\⒈\‎\⒓\㏫\㋋\㍤\⒔\㏬\㍥\⒕\㏭\㍦\⒖\㏮\㍧\⒗\㏯\㍨\⒘\㏰\㍩\⒙\㏱\㍪\⒚\㏲\㍫\lj\IJ\‖\∥\Ⅱ\ǁ\‎\𐆙\⒒\Ⅲ\𐆘\㏪\㋊\㍣\Ю\⒑\㏩\㋉\㍢\ʪ\₶\Ⅳ\Ⅸ\ɮ\ʫ\㏠\㋀\㍙] [o\ం\ಂ\ം\ං\०\੦\૦\௦\౦\೦\൦\๐\໐\၀\‎\۵\o\ℴ\𝐨\𝑜\𝒐\𝓸\𝔬\𝕠\𝖔\𝗈\𝗼\𝘰\𝙤\𝚘\ᴏ\ᴑ\ꬽ\ο\𝛐\𝜊\𝝄\𝝾\𝞸\σ\𝛔\𝜎\𝝈\𝞂\𝞼\ⲟ\о\ჿ\օ\‎\‎\‎\‎\‎\‎\‎\‎\‎\‎\‎\‎\‎\‎\‎\‎\‎\‎\‎\‎\ഠ\ဝ\𐓪\𑣈\𑣗\𐐬\‎\ø\ꬾ\ɵ\ꝋ\ө\ѳ\ꮎ\ꮻ\ꭴ\‎\ơ\œ\ɶ\∞\ꝏ\ꚙ\ൟ\တ]

This regexp will match against "𝓗℮𝐥1೦"




回答4:


I don't have solution per se, but I have some ideas.

@collapsar's approach in the comments sounds good to me in principle, but I think you'd want to use an off-the-shelf OCR library rather than try to analyze the images yourself. To make the images, I'd use a font like something in the DejaVu family, because it has good coverage of relatively obscure Unicode characters.

Another easy way to get data is to look at the decompositions of "precomposed" characters like "à"; if a character can be decomposed into one or more combining chapters followed by a base character that looks like an English letter, it probably looks like an English letter itself.

Nothing beats lots of data for a problem like this. You could collect a lot of good examples of character substitutions people have made by scraping the right web forums. Then you can use this procedure to learn new ones: first, find "words" containing mostly characters you can identify, along with some you can't. Make a regex from the word, converting everything you can to regular letters and replacing everything else with ".". Then match your regex against a dictionary, and if you get only one match, you have some very good candidates for what the unknown characters are supposed to represent. (I would not actually use a regex for searching a dictionary, but you get the idea.)

Instead of mining forums, you may be able to use Google's n-gram corpus (http://storage.googleapis.com/books/ngrams/books/datasetsv2.html) instead, but I'm not able to check right now if it contains the kind of pseudo-words you need.



来源:https://stackoverflow.com/questions/9491890/is-there-a-list-of-characters-that-look-similar-to-english-letters

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!