问题
When using PHP Simple HTML DOM Parser, is it normal that line breaks
tags are stripped out?
回答1:
I know this is old, but I was looking for this as well, and realized there was actually a built in option to turn off the removal of line breaks. No need to go editing the source.
The PHP Simple HTML Dom Parser's load
function supports multiple useful parameters:
load($str, $lowercase=true, $stripRN=false, $defaultBRText=DEFAULT_BR_TEXT)
When calling the load
function, simply pass false
as the third parameter.
$html = new simple_html_dom();
$html->load("<html><head></head><body>stuff</body></html>", true, false);
If using file_get_html
, it's the ninth parameter.
file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT)
Edit: For str_get_html
, it's the fifth parameter (Thanks yitwail)
str_get_html($str, $lowercase=true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
回答2:
Was struggling with this as well, since I needed the HTML to be easily editable after processing.
Apparently there's a boolean in the SimpleHTMLDOM
script $stripRN
, that's set to true
on default. It strips the \r
, \n
or \r\n
tags in the HTML.
Set the var to false
(several occurences in the script..) and your problem is solved.
回答3:
You don't have to change all $stripRN
to false, the only one that affects this behavior is at line 816 ``:
// load html from string
function load($str, $lowercase=true, $stripRN=false, $defaultBRText=DEFAULT_BR_TEXT) {
Also consider to change line 988, because multibyte functions often are not installed on machines that do not deal with non-wester-european languages. Original line in v1.5 breaks the script immediately:
if (function_exists('mb_detect_encoding')) { $charset = mb_detect_encoding($this->root->plaintext . "ascii", $encoding_list = array( "UTF-8", "CP1252" ) ); } else $charset === false;
回答4:
If you were passing by here wondering if you can do the same thing in DomDocument then I'm please to say you can! - but it's a bit dirty :(
I had a snippet of code I wanted to tidy but retain the exact line breaks it contained (\n). This is what I did....
// NOTE: If you're HTML isn't a full HTML document then expect DomDocument to
// start creating its own DOCTYPE, head and body tags.
// Convert \n into a pretend tag
$myContent = preg_replace("/[\n]/","<img src=\"slashN\" />",$myContent);
// Do your DOM stuff...
$dom = new DOMDocument;
$dom->loadHTML($myContent);
$dom->formatOutput = true;
$myContent = $dom->saveHTML();
// Remove the \n's that DOMDocument put in itself
$myContent = preg_replace("/[\n]/","",$myContent);
// Put my own \n's back
$myContent = preg_replace("/<img src=\"slashN\" \/>/i","\n",$myContent);
It's important to note that I know, without a shadow of a doubt that my input contained only \n. You may want your own variations if \r\n or \t needs to be accounted for. eg slash.T or slash.RN etc
回答5:
Another option should one wish to preserve other formatting such as paragraphs & headings is to use innertext
rather than plaintext
then perform your own string cleaning with the result.
I realise there is a performance hit but it does allow for more granular control.
来源:https://stackoverflow.com/questions/4812691/preserve-line-breaks-simple-html-dom-parser