how to use dom php parser

前端 未结 4 1567
轻奢々
轻奢々 2020-11-28 13:04

I\'m new to DOM parsing in PHP:
I have a HTML file that I\'m trying to parse. It has a bunch of DIVs like this:

<
相关标签:
4条回答
  • 2020-11-28 13:32

    I got this to work using simplehtmldom as a start:

    $html = file_get_html('example.com');
    foreach ($html->find('div[id=interestingbox]') as $result)
    {
        echo $result->innertext;
    }
    
    0 讨论(0)
  • 2020-11-28 13:36

    First i have to tell you that you can't use the same id on two different divs; there are classes for that point. Every element should have an unique id.

    Code to get the contents of the div with id="interestingbox"

    $html = '
    <html>
    <head></head>
    <body>
    <div id="interestingbox"> 
       <div id="interestingdetails" class="txtnormal">
            <div>Content1</div>
            <div>Content2</div>
       </div>
    </div>
    
    <div id="interestingbox2"><a href="#">a link</a></div>
    </body>
    </html>';
    
    
    $dom_document = new DOMDocument();
    
    $dom_document->loadHTML($html);
    
    //use DOMXpath to navigate the html with the DOM
    $dom_xpath = new DOMXpath($dom_document);
    
    // if you want to get the div with id=interestingbox
    $elements = $dom_xpath->query("*/div[@id='interestingbox']");
    
    if (!is_null($elements)) {
    
      foreach ($elements as $element) {
        echo "\n[". $element->nodeName. "]";
    
        $nodes = $element->childNodes;
        foreach ($nodes as $node) {
          echo $node->nodeValue. "\n";
        }
    
      }
    }
    
    //OUTPUT
    [div]  {
            Content1
            Content2
    }
    

    Example with classes:

    $html = '
    <html>
    <head></head>
    <body>
    <div class="interestingbox"> 
       <div id="interestingdetails" class="txtnormal">
            <div>Content1</div>
            <div>Content2</div>
       </div>
    </div>
    
    <div class="interestingbox"><a href="#">a link</a></div>
    </body>
    </html>';
    
    //the same as before.. just change the xpath
    
    [...]
    
    $elements = $dom_xpath->query("*/div[@class='interestingbox']");
    
    [...]
    
    //OUTPUT
    [div]  {
            Content1
            Content2
    }
    
    [div]  {
    a link
    }
    

    Refer to the DOMXPath page for more details.

    0 讨论(0)
  • 2020-11-28 13:40

    Very nice function from http://www.sitepoint.com/forums/showthread.php?611393-php5-need-something-like-innerHTML-instead-of-nodeValue

    function innerXML($node) 
    
    { 
    
        $doc  = $node->ownerDocument; 
    
        $frag = $doc->createDocumentFragment(); 
    
        foreach ($node->childNodes as $child) 
    
        { 
    
            $frag->appendChild($child->cloneNode(TRUE)); 
    
        } 
    
        return $doc->saveXML($frag); 
    
    }  
    
    
    $dom = new DOMDocument(); 
    
    $dom->loadXML(' 
    
    <html> 
    
    <body> 
    
    <table> 
    
    <tr> 
    
        <td id="foo">  
    
            The first bit of Data I want 
    
            <br />The second bit of Data I want 
    
            <br />The third bit of Data I want 
    
        </td> 
    
    </tr> 
    
    </table> 
    
    <body> 
    
    <html> 
    
    
    
    '); 
    
    $xpath = new DOMXPath($dom); 
    
    $node = $xpath->evaluate("/html/body//td[@id='foo' ]"); 
    
    $dataString = innerXML($node->item(0)); 
    $dataArr = explode("<br />", $dataString); 
    
    $dataUno = $dataArr[0]; 
    $dataDos = $dataArr[1]; 
    $dataTres = $dataArr[2]; 
    
    echo "firstdata = $nameUno<br />seconddata = $nameDos<br />thirddata = $nameTres<br />"  
    
    0 讨论(0)
  • 2020-11-28 13:42

    WebExtractor: https://github.com/knyga/webextractor It can parse page with css, regex, xpath selectors.

    Look package and tests for examples:

    use WebExtractor\DataExtractor\DataExtractorFactory; use WebExtractor\DataExtractor\DataExtractorTypes; use WebExtractor\Client\Client;

    $factory = DataExtractorFactory::getFactory(); $extractor = $factory->createDataExtractor(DataExtractorTypes::CSS); $client = new Client; $content = $client->get('https://en.wikipedia.org/wiki/2014_Winter_Olympics'); $extractor->setContent($content); $h1 = $extractor->setSelector('h1')->extract();

    0 讨论(0)
提交回复
热议问题