Keeping line breaks when using PHP's DomDocument appendChild

邮差的信 提交于 2021-01-28 04:09:06

问题


I'm trying to use the DOMDocument in PHP to add/parse things in an HTML document. From what I could read, setting the formOutput to true and preserveWhiteSpace to false should keep the tabs and newlines in order, but it doesn't seem like it is for newly created or appended nodes.

Here's the code:

$dom = new \DOMDocument;
$dom->formatOutput = true;
$dom->preserveWhiteSpace = false;
$dom->loadHTMLFile($htmlsource);
$tables = $dom->getElementsByTagName('table');
foreach($tables as $table)
{
    $table->setAttribute('class', 'tborder');
    $div = $dom->createElement('div');
    $div->setAttribute('class', 'm2x');
    $table->parentNode->insertBefore($div, $table);
    $div->appendChild($table);
}
$dom->saveHTMLFile($html)

Here's what the HTML looks like:

<table>
    <tr>
        <td></td>
    </tr>
</table>

Here's what I want:

<div class="m2x">
    <table class="tborder">
        <tr>
            <td></td>
        </tr>
    </table>
</div>

Here's what I get:

<div class="m2x"><table class="tborder"><tr>
<td></td>
        </tr></table></div>

Is there something I'm doing wrong? I've tried googling this as many different ways as I could thing of with no luck.


回答1:


Unfortunately, you might need to write a function that indents the output how you want it. I made a little function you might find helpful.

function indentContent($content, $tab="\t")
{               

        // add marker linefeeds to aid the pretty-tokeniser (adds a linefeed between all tag-end boundaries)
        $content = preg_replace('/(>)(<)(\/*)/', "$1\n$2$3", $content);

        // now indent the tags
        $token = strtok($content, "\n");
        $result = ''; // holds formatted version as it is built
        $pad = 0; // initial indent
        $matches = array(); // returns from preg_matches()

        // scan each line and adjust indent based on opening/closing tags
        while ($token !== false) 
        {
                $token = trim($token);
                // test for the various tag states

                // 1. open and closing tags on same line - no change
                if (preg_match('/.+<\/\w[^>]*>$/', $token, $matches)) $indent=0;
                // 2. closing tag - outdent now
                elseif (preg_match('/^<\/\w/', $token, $matches))
                {
                        $pad--;
                        if($indent>0) $indent=0;
                }
                // 3. opening tag - don't pad this one, only subsequent tags
                elseif (preg_match('/^<\w[^>]*[^\/]>.*$/', $token, $matches)) $indent=1;
                // 4. no indentation needed
                else $indent = 0;

                // pad the line with the required number of leading spaces
                $line = str_pad($token, strlen($token)+$pad, $tab, STR_PAD_LEFT);
                $result .= $line."\n"; // add to the cumulative result, with linefeed
                $token = strtok("\n"); // get the next token
                $pad += $indent; // update the pad size for subsequent lines    
        }       

        return $result;
}

indentContent($dom->saveHTML()) will return:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
    <body>
        <div class="m2x">
            <table class="tborder">
                <tr>
                    <td>
                    </td>
                </tr>
            </table>
        </div>
    </body>
</html>

I created this function starting with this one.




回答2:


I modified the great function ghbarratt wrote, so it doesn't indent void elements.

function indentContent($content, $tab="\t")
{
    // add marker linefeeds to aid the pretty-tokeniser (adds a linefeed between all tag-end boundaries)
    $content = preg_replace('/(>)(<)(\/*)/', "$1\n$2$3", $content);

    // now indent the tags
    $token = strtok($content, "\n");
    $result = ''; // holds formatted version as it is built
    $pad = 0; // initial indent
    $matches = array(); // returns from preg_matches()

    // scan each line and adjust indent based on opening/closing tags
    while ($token !== false) 
    {
        $token = trim($token);
        // test for the various tag states

        // 1. open and closing tags on same line - no change
        if (preg_match('/.+<\/\w[^>]*>$/', $token, $matches)) $indent=0;
        // 2. closing tag - outdent now
        elseif (preg_match('/^<\/\w/', $token, $matches))
        {
            $pad--;
            if($indent>0) $indent=0;
        }
        // 3. opening tag - don't pad this one, only subsequent tags (only if it isn't a void tag)
        elseif (preg_match('/^<\w[^>]*[^\/]>.*$/', $token, $matches))
        {
            $voidTag = false;
            foreach ($matches as $m)
            {
                // Void elements according to http://www.htmlandcsswebdesign.com/articles/voidel.php
                if (preg_match('/^<(area|base|br|col|command|embed|hr|img|input|keygen|link|meta|param|source|track|wbr)/im', $m))
                {
                    $voidTag = true;
                    break;
                }
            }

            if (!$voidTag) $indent=1;
        }
        // 4. no indentation needed
        else $indent = 0;

        // pad the line with the required number of leading spaces
        $line = str_pad($token, strlen($token)+$pad, $tab, STR_PAD_LEFT);
        $result .= $line."\n"; // add to the cumulative result, with linefeed
        $token = strtok("\n"); // get the next token
        $pad += $indent; // update the pad size for subsequent lines    
    }    

    return $result;
}

All credits go to ghbarratt.




回答3:


Both @Stan and @ghbarrat does not go well with <!DOCTYPE html> html5 declaration. It kind of passes indentation to <head> element.

Expected:

<!DOCTYPE html>
<html>
  <head>
    <meta charset="UTF-8">
  </head>
  <body>
    <!-- all good -->
  </body>
</html>

Result:

<!DOCTYPE html>
<html>
  <head>
    <meta charset="UTF-8">
    </head>
    <body>
      <!-- all good -->
    </body>
  </html>

A little bit of testing, reveals a partial fix when I add <html> element to Void element list, however that does not solve the problem with head and it also flattens children (namely head and body).

Edit #1 It appears <meta charset="UTF-8"> after all is responsible for incorrect indentation.

Edit #2 - Solution

After little troubleshooting I discovered that <meta> as a self-closing tag would impact next closing tag, which is solved by adding a flag. The flag defines if we found self-closing tag, then next instance of a closing tag would have an extra negative indent.

function indentContent($content, $tab="\t"){
    // add marker linefeeds to aid the pretty-tokeniser (adds a linefeed between all tag-end boundaries)
    $content = preg_replace('/(>)(<)(\/*)/', "$1\n$2$3", $content);

    // now indent the tags
    $token = strtok($content, "\n");
    $result = ''; // holds formatted version as it is built
    $pad = 0; // initial indent
    $matches = array(); // returns from preg_matches()

    // scan each line and adjust indent based on opening/closing tags
    while ($token !== false && strlen($token)>0)
    {
        $token = trim($token);
        // test for the various tag states

        // 1. open and closing tags on same line - no change
        if (preg_match('/.+<\/\w[^>]*>$/', $token, $matches)) $indent=0;
        // 2. closing tag - outdent now
        elseif (preg_match('/^<\/\w/', $token, $matches))
        {
            $pad--;
            if($indent>0) $indent=0;
            if($nextTagNegative){
                $pad--;$nextTagNegative=false;
            }
        }
        // 3. opening tag - don't pad this one, only subsequent tags (only if it isn't a void tag)
        elseif (preg_match('/^<\w[^>]*[^\/]>.*$/', $token, $matches))
        {
            $voidTag = false;
            foreach ($matches as $m)
            {
                // Void elements according to http://www.htmlandcsswebdesign.com/articles/voidel.php
                if (preg_match('/^<(area|base|br|col|command|embed|hr|img|input|keygen|link|meta|param|source|track|wbr)/im', $m))
                {
                    $voidTag = true;
                    break;
                }
            }

            if (!$voidTag) $indent=1;$nextTagNegative=true;
        }
        // 4. no indentation needed
        else $indent = 0;

        // pad the line with the required number of leading spaces
        $line = str_pad($token, strlen($token)+$pad, $tab, STR_PAD_LEFT);
        $result .= $line."\n"; // add to the cumulative result, with linefeed
        $token = strtok("\n"); // get the next token
        $pad += $indent; // update the pad size for subsequent lines
    }

    return $result;
}


来源:https://stackoverflow.com/questions/7838929/keeping-line-breaks-when-using-phps-domdocument-appendchild

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!