Cleaning all inline events from HTML tags

旧巷老猫 提交于 2019-12-05 01:32:21

问题


For HTML input, I want to neutralize all HTML elements that have inline js (onclick="..", onmouseout=".." etc). I am thinking, isn't it enough to encode the following chars? =,(,)

So onclick="location.href='ggg.com'"
will become onclick%3D"location.href%3D'ggg.com'"

What am I missing here?

Edit: I do need to accept active HTML (I can't escape it all or entities is it).


回答1:


There's no simple method to accept HTML, but not scripts.

You have to parse HTML to DOM, remove all unwanted elements and attributes in DOM and generate new HTML.

It can't be done reliably with regular expressions.

on* attributes are not enough. Scripts can be embedded in style, src, href and other attributes.

If you're using PHP, then use HTML Purifier.




回答2:


You probably have a couple of options... easiest way is to convert quotes, and possibly <> characters, to their HTML encoded equivalents (" etc.), which will result in the HTML code being displayed literally.

Tell me what server-side language are you using and I can point you towards more language-specific information, if you like. (For example, PHP has htmlspecialchars()[1]).

EDIT: I just actually read your question. Okay, you want to allow HTML through but no JavaScript? Well, for lack of a simple solution jumping to my mind, I suggest just using string replacement (regular expressions if you can, maybe?) to get rid of them entirely.

There are a finite set of event handler attributes in JavaScript. Couple that with the need for quotation marks and you're probably good.

For proof of concept, in Perl, you'd probably do something like this:

$myInput =~ s/on(mouseover|mouseout|click|focus|blur|[...])(\"[^\"]*\")|(\'[^\']*\')\s*//gi;

So, capture the event handler name (only some of which I included), then a quoted expression using either single or double quotes, have optional whitespace on the end, and replace the entire thing with nothing (i.e., delete it).

That won't work for something requiring more levels of quotation, though, since eventually you would come back to the original delimiters. Forgive the contrived and completely useless example:

onclick="eval('3+prompt("Enter a number: ")')"

In THAT case, you might want to write a loop that parses the string first by word (i.e., looking for the event handler name), then going character by character, keeping track of the number of quoting levels as you go and keeping track of the current delimiter:

  1. Mark the index of the beginning of the handler name (the "o" in onclick, etc.)
  2. Start with quoting level 0 (or 1 after you've processed the opening quotation delimiter).
  3. If the current delimiter is " and you see ', then increase the quoting level by 1 and switch current delimiter to '.
  4. If the current delimiter is " and you see ", decrease the quoting level by 1 and switch current delimiter to '.
  5. If the current delimiter is ' and you see ", then increase the quoting level by 1 and switch current delimiter to '.
  6. If the current delimiter is ' and you see ', decrease the quoting level by 1 and switch current delimiter to '.
  7. If the quoting level gets back down to 0, then your string has ended. Mark the index of where the string ends.
  8. Use a string manipulation function to cut out the substring from the first index to the last index.

It's a little more time-consuming, but it should theoretically work no matter what, assuming the HTML is well-formed. (That's a horrible assumption, but if it's not well-formed you could just reject the input anyway!)

[1] http://us3.php.net/manual/en/function.htmlspecialchars.php



来源:https://stackoverflow.com/questions/1258411/cleaning-all-inline-events-from-html-tags

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!