Bare-minimum text sanitation

杀马特。学长 韩版系。学妹 提交于 2020-01-24 06:55:42

问题


In an application that accepts, stores, processes, and displays Unicode text (for the purpose of discussion, let's say that it's a web application), which characters should always be removed from incoming text?

I can think of some, mostly listed in the C0 and C1 control codes Wikipedia article:

  1. The range 0x00-0x19 (mostly control characters), excluding 0x09 (tab), 0x0A (LF), and 0x0D (CR)

  2. The range 0x7F-0x9F (more control characters)

Ranges of characters that can safely be accepted would be even better to know.

There are other levels of text filtering — one might canonicalize characters that have multiple representations, replace nonbreaking characters, and remove zero-width characters — but I'm mainly interested in the basics.


回答1:


See the W3 Unicode in XML and other markup languages note. It defines a class of characters as ‘discouraged for use in markup’, which I'd definitely filter out for most web sites. It notably includes such characters as:

  • U+2028–9 which are funky newlines that will confuse JavaScript if you try to use them in a string literal;

  • U+202A–E which are bidi control codes that wily users can insert to make text appear to run backwards in some browsers, even outside of a given HTML element;

  • language override control codes that could also have scope outside of an element;

  • BOM.

Additionally, you'd want to filter/replace the characters that are not valid in Unicode at all (U+FFFF et al), and, if you are using a language that works in UTF-16 natively (eg. Java, Python on Windows), any surrogate characters (U+D800–U+DFFF) that do not form valid surrogate pairs.

The range 0x00-0x19 (mostly control characters), excluding 0x09 (tab), 0x0A (LF), and 0x0D (CR)

And arguably (esp for a web application), lose CR as well, and turn tabs into spaces.

The range 0x7F-0x9F (more control characters)

Yep, away with those, except in case where people might really mean them. (SO used to allow them, which allowed people to post strings that had been mis-decoded, which was occasionally useful for diagnosing Unicode problems.) For most sites I think you'd not want them.




回答2:


I suppose it depends on your purpose. In UTF-8, you could limit the user to the keyboard characters if that is your whim, which is 9,10,13,[32-126]. If you are using UTF-8, the 0x7f+ range signifies that you have a multi-byte Unicode character. In ASCII, 0x7f+ consists special display/format characters, and is localized to allow extensions depending on the language at the location.

Note that in UTF-8, the keyboard characters can differ depending on location, since users can input characters in their native language which will be outside the 0x00-0x7f range if their language doesn't use a Latin script without accents (Arabic, Chinese, Japanese, Greek, Crylic, etc.).

If you take a look here you can see what characters from UTF-8 will display.



来源:https://stackoverflow.com/questions/3197639/bare-minimum-text-sanitation

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!