What are all the HTML escaping contexts?

后端未结

关注

 5  1089

予麋鹿 2021-02-09 03:45

When outputting HTML, there are several different places where text can be interpreted as control characters rather than as text literals. For example, in \"regular\" text (tha

5条回答

再見小時候 (楼主)

2021-02-09 03:58
```
This is regular text
```
Text content: & must be escaped. < must be escaped.

If producing a document in a non-UTF encoding, characters that do not fit inside the chosen encoding must be escaped.

In XHTML (and XML in general), the sequence ]]> must not occur in text content, so in that specific case one of the characters in that sequence must be escaped, traditionally the >. For consistency, the Canonical XML specification chooses to escape > every time in text content, which is not a bad strategy for an escaping function, though you can certainly skip it for hand-authoring.
Attribute values: & must be escaped. The attribute value delimiter " or ' must be escaped. If no attribute value delimiter is used (don't do that) no escape is possible.

Canonical XML always chooses " as the delimiter and therefore escapes it. The > character does not need to be escaped in attribute values and Canonical XML does not. The HTML4 spec suggested encoding > anyway for backwards compatibility, but this affects only a few truly ancient and dreadful browsers that no-one remembers now; you can ignore that.

In XHTML < must be escaped. Whilst you can get away with not escaping it in HTML4, it's not a good idea.

To include tabs, CR or LF in attribute values (without them being turned into plain spaces by the attribute value normalisation algorithm) you must encode them as character references.

For both text content and attribute values: in XHTML under XML 1.1, you must escape the Restricted Characters, which are the Delete character and C0 and C1 control codes, minus tab, CR, LF and NEL. In total, [\x01-\x08\x0B\x0C\x0E-\x1F\x7F-\x84\x86-\x9F]. The null character may not be included at all even escaped in XML 1.1. Outside XML 1.1 you can't use any of these characters at all, nor is there a good reason you'd ever want to.
Yes, but since there is no escaping possible inside comments, there is nothing you can do about it. If you write , it literally means a comment containing “ampersand-letter l-letter t-semicolon” and will be reflected as such in the DOM or other infoset. A comment containing -- simply cannot be serialised at all.

sections and s in XML also cannot use escaping. The traditional solution to serialise a CDATA section including a ]]> sequence is to split that sequence over two CDATA sections so it doesn't occur together. You can't serialise it in a single CDATA section, and you can't serialise a PI with ?> in the data.
CDATA-elements like