By simply converting the following (\"the big 5\"):
& -> &
< -> <
> -> >
\" -> "
\' -> '
Counter measures depend on the context where the data is inserted in. If you insert the data into HTML, replacing the HTML meta character with escape sequences (i.e. character references) prevents inserting HTML code.
But if your in another context (e.g. HTML attribute value that is interpreted as URL) you have additional meta characters with different escape sequences you have to deal with.
OWASP has a great cheat sheet.
https://github.com/OWASP/CheatSheetSeries/blob/master/cheatsheets/Cross_Site_Scripting_Prevention_Cheat_Sheet.md
Will you prevent XSS attacks?
If you do this escaping at the right time(*) then yes, you will prevent HTML-injection. This is the most common form of XSS attack. It is not just a matter of security, you need to do the escapes anyway so that strings with those characters in will display correctly anyway. The issue of security is a subset of the issue of correctness.
I think you need to white list at a character level too, to prevent certain attacks
No. HTML-escaping will render every one of those attacks as inactive plain text on the page, which is what you want. The range of attacks on that page is demonstrating different ways to do HTML-injection, which can get around the stupider “XSS filters” that some servers deploy to try to prevent common HTML-injection attacks. This demonstrates that “XSS filters” are inherently leaky and ineffective.
There are other forms of XSS attack that might or might not affect you, for example bad schemes on user-submitted URIs (javascript:
et al), injection of code into data echoed into a JavaScript block (where you need JSON-style escaping) or into stylesheets or HTTP response headers (again, you always need the appropriate form of encoding when you drop text into another context; you should always be suspicious if you see anything with unescaped interpolation like PHP's "string $var string"
).
Then there's file upload handling, Flash origin policy, UTF-8 overlong sequences in legacy browsers, and application-level content generation issues; all of these can potentially lead to cross-site scripting. But HTML injection is the main one that every web application will face, and most PHP applications get wrong today.
(*: which is when inserting text content into HTML, and at no other time. Do not HTML-escape form submission data in $_POST
/$_GET
at the start of your script; this is a common wrong-headed mistake.)