I want to remove new lines from some html (with php) except in
tags where whitespace is obviously important.
Split the content up. This is easily done with...
$blocks = preg_split('/<(|\/)pre>/', $html);
Just be careful, because the $blocks elements won't contain the pre opening and closing tags. I feel that assume the HTML is valid is acceptable, and therefore you can expect the pre-blocks to be every other element in the array (1, 3, 5, ...). Easily tested with $i % 2 == 1
.
Example "complete" script (modify as you need to)...
<?php
//out example HTML file - could just as easily be a read in file
$html = <<<EOF
<html>
<head>
<title>test</title>
</head>
<body>
<h1>Title</h1>
<p>
This is an article about...
</p>
<pre>
line one
line two
line three
</pre>
<div style="float: right:">
random
</div>
</body>
</html>
EOF;
//break it all apart...
$blocks = preg_split('/<(|\/)pre>/', $html);
//and put it all back together again
$html = ""; //reuse as our buffer
foreach($blocks as $i => $block)
{
if($i % 2 == 1)
$html .= "\n<pre>$block</pre>\n"; //break out <pre>...</pre> with \n's
else
$html .= str_replace(array("\n", "\r"), "", $block, $c);
}
echo $html;
?>
It may be 3 years later, but... The following code will remove all line breaks and whitespace at long as it is outside of pre tags. Cheers!
function sanitize_output($buffer)
{
$search = array(
'/\>[^\S ]+/s', //strip whitespaces after tags, except space
'/[^\S ]+\</s', //strip whitespaces before tags, except space
'/(\s)+/s' // shorten multiple whitespace sequences
);
$replace = array(
'>',
'<',
'\\1'
);
$blocks = preg_split('/(<\/?pre[^>]*>)/', $buffer, null, PREG_SPLIT_DELIM_CAPTURE);
$buffer = '';
foreach($blocks as $i => $block)
{
if($i % 4 == 2)
$buffer .= $block; //break out <pre>...</pre> with \n's
else
$buffer .= preg_replace($search, $replace, $block);
}
return $buffer;
}
ob_start("sanitize_output");
If the html is well formed, you can rely on the fact that <pre>
tags aren't allowed to be nested. Make two passes: First you split the input into block of pre tags and everything else. You can use a regular expression for this task. Then you strip new lines from each non-pre block, and finally join them all back together.
Note that most html isn't well formed, so this approach may have some limits to where you can use it.