Using HTML Purifier on a site with only plain text input

巧了我就是萌 提交于 2019-12-01 14:16:57

As a general rule, escaping should be done for context and for use-case.

If what you want to do is output plain text in an HTML context (and you do), then you need to use escaping functionality that will ensure that you will always output plain text in an HTML context. Given basic PHP, that would indeed be htmlspecialchars($yourString, ENT_QUOTES, 'yourEncoding');.

If what you want to do is output HTML in an HTML context (you don't), then you would want to santitise the HTML when you output it to prevent it from doing damage - here you would $purifier->purify($yourString); on output.

If you want to store plain text user input in a database (again, you do) by executing SQL statements, then you should either use prepared statements to prevent SQL injection, or an escaping function specific to your DB, such as mysql_real_escape_string($yourString).

You should not:

  • escape for HTML when you are putting data into the database
  • sanitise as HTML when you are putting data into the database
  • sanitise as HTML when you are outputting data as plain text

Of those, all are outright harmful, albeit to different degrees. Note that the following assumes the database is your only or canonical storage medium for the data (it also assumes you have SQL injection taken care of in some other way - if you don't, that'll be your primary issue):

  • if you escape for HTML when you put the data into the database, you rely on the guarantee that you will always be outputting the data into an HTML context; suddenly if you want to just put it into a plaintext file for printing as-is, you need to decode the data before you output it.
  • if you sanitise as HTML when you put the data into the database, you are destroying information that your user put there. Is it a messaging system and your user wanted to tell someone else about <script> tags? Your user can't do that - you'll destroy that part of his message!

Sanitising as HTML when you're outputting data as plain text (without also escaping it) may have confusing, page-breaking results if you don't set your sanitising module to strip all HTML (which you shouldn't, since then you clearly don't want to be outputting HTML).

Did you sanitise for a <div> context, but are putting your data into an inline element? Your user might put a <div> into your inline element, forcing a layout break into your page layout (how annoying this is depends on your layout), or to influence user perception of metadata (for example to make phishing easier), e.g. like this:

  • Name: John Doe
    (Site admin)

Did you sanitise for a <span> context? The user could use other tags to influence user perception of metadata, e.g. like this:

  • Name: John Doe (this user is an administrator)

Worst-case scenario: Did you sanitise your HTML with a version of HTML Purifier that later turns out to have a bug that does allow a certain kind of malicious HTML to survive? Now you're outputting untrusted data and putting users that view this data on your web page at risk.

Sanitising as HTML and escaping for HTML (in that order!) does not have this problem, but it means the sanitising step is unnecessary, meaning this constellation will just cost you performance. (Presumably that's why your colleague wanted to do the sanitising when saving the data, not when displaying it - presumably your use-case (like most) will display the data more often than the data will be submitted, meaning you would avoid having to deal with the performance hit frequently.)

tl;dr

Sanitising as HTML when you're outputting as plain text is not a good idea.

Escape / sanitise for use-case and context.

In your situation, you want to escape plain text for an HTML context (= use htmlspecialchars()).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!