Get contents of BODY without DOCTYPE, HTML, HEAD and BODY tags

后端 未结 7 1590
情话喂你
情话喂你 2021-02-12 17:45

What I am trying to do is include an HTML file within a PHP system (not a problem) but that HTML file also needs to be usable on its own, for various reasons, so I need to know

相关标签:
7条回答
  • 2021-02-12 18:21

    You may want to use PHP tidy extension which can fix invalid XHTML structures (in which case DOMDocument load crashes) and also extract body only:

    $tidy = new tidy();
    $htmlBody = $tidy->repairString($html, array(
        'output-xhtml' => true,
        'show-body-only' => true,
    ), 'utf8');
    

    Then load extracted body into DOMDocument:

    $xml = new DOMDocument();
    $xml->loadHTML($htmlBody);
    

    Then traverse, extract, move around XML nodes etc .. and save:

    $output = $xml->saveXML();
    
    0 讨论(0)
  • 2021-02-12 18:28

    A solution with only one instance of DOMDocument and without loops

    $d = new DOMDocument();
    $d->loadHTML(file_get_contents('/path/to/my.html'));
    $body = $d->getElementsByTagName('body')->item(0);
    echo $d->saveHTML($body);
    
    0 讨论(0)
  • Use a DOM parser. this is not tested but ought to do what you want

    $domDoc = new DOMDocument();
    $domDoc.loadHTMLFile('/path/to/file');
    $body = $domDoc->GetElementsByTagName('body')->item(0);
    foreach ($body->childNodes as $child){
        echo $child->C14N(); //Note this cannonicalizes the representation of the node, but that's not necessarily a bad thing
    }
    

    If you want to avoid cannonicalization, you can use this version (thanks to @Jared Farrish)

    0 讨论(0)
  • 2021-02-12 18:31

    This may be a solution. I tried it and it works fine.

    function parseHTML(string) {
          var   parser = new DOMParser
         , result = parser.parseFromString(string, "text/html");
          return result.firstChild.lastChild.firstChild;
        }

    0 讨论(0)
  • 2021-02-12 18:32
    $site = file_get_contents("http://www.google.com/");
    
    preg_match("/<body[^>]*>(.*?)<\/body>/is", $site, $matches);
    
    echo($matches[1]);
    
    0 讨论(0)
  • 2021-02-12 18:39

    Use DOMDocument to keep what you need rather than strip what you don't need (PHP >= 5.3.6)

    $d = new DOMDocument;
    $d->loadHTMLFile($fileLocation);
    $body = $d->getElementsByTagName('body')->item(0);
    // perform innerhtml on $body by enumerating child nodes 
    // and saving them individually
    foreach ($body->childNodes as $childNode) {
      echo $d->saveHTML($childNode);
    }
    
    0 讨论(0)
提交回复
热议问题