I am trying to find out if there are any principles in defining which pages should be gzip-compressed and to draw a line when to send plain html content.
It would be he
We made the decision to gzip all content since spending time determining what to gzip or what not to gzip did not seem worth the effort. The overhead of gzipping everything is not significantly higher than gzipping nothing.
This webpage suggests:
"Servers choose what to gzip based on file type, but are typically too limited in what they decide to compress. Most web sites gzip their HTML documents. It's also worthwhile to gzip your scripts and stylesheets, but many web sites miss this opportunity. In fact, it's worthwhile to compress any text response including XML and JSON. Image and PDF files should not be gzipped because they are already compressed. Trying to gzip them not only wastes CPU but can potentially increase file sizes."
If you care about cpu time, I would suggest not gzipping already compressed content. Remember when adding complexity to a system that Programmers/sys admins are expensive, servers are cheap.
Unless your server's CPU is heavily utilized, I would always use compression. It's a trade-off between bandwidth and CPU utilization, and webservers usually have plenty of spare CPU cycles to spare.
I don't think there's a good reason not to gzip HTML content.
It takes very little CPU power for big gains in loading speed.
A good idea is to benchmark, how fast is the data coming down v.s. how well compressed is it. If it takes 5 seconds to send something that went from 200K to 160K it's probably not worth it. There is a cost of compression on the server side and if the server gets busy it might not be worth it.
For the most part, if your server load is below 0.8 regularly, I'd just gzip anything that isn't binary like jpegs, pngs and zip files.
There's a good write up here:
http://developer.yahoo.com/performance/rules.html#gzip
There's one notable exception: There's a bug in Internet Explorer 6, that makes all compressed content turn up blank.
Considering there is a huge gain on the size of the HTML data to download when it's gzipped, I don't see why you shouldn't gzip it.
Maybe it uses a little bit of CPU... But not that much ; and it's really interesting for the client, who has less to download. And it's only a couple of lines in the webserver configuration to activate it.
(But let your webserver do that : there are modules like mod_deflate for the most used servers)
As a semi-sidenote : you are talking about compressing HTML content pages... But stop at HTML pages : you can compress JS and CSS too (they are text files, and, so, are generally compressed really well), and it doesn't cost much CPU either.
Considering the big JS/CSS Frameworks in use nowadays, the gain is probably even more consequent by compressing those than by compressing HTML pages.