Extracting Site data through Web Crawler outputs error due to mis-match of Array Index

后端 未结 4 802
再見小時候
再見小時候 2021-01-11 13:10

I been trying to extract site table text along with its link from the given table to (which is in site1.com) to my php page using a web crawler.

But unfortunately,

相关标签:
4条回答
  • 2021-01-11 13:28

    Chopping at html with string functions or regex is not a reliable method. DomDocument and Xpath do a nice job.

    Code: (Demo)

    $dom=new DOMDocument; 
    $dom->loadHTML($html);
    $xpath = new DOMXPath($dom);
    foreach ($xpath->evaluate("//td[@class = 'FootNotes2']/a") as $node) {  // target a tags that have <td class="FootNotes2"> as parent
        $result[]=['href' => $node->getAttribute('href'), 'text' => $node->nodeValue];  // extract/store the href and text values
        if (sizeof($result) == 10) { break; }  // set a limit of 10 rows of data
    }
    if (isset($result)) {
        echo "<ul>\n";
        foreach ($result as $data) {
            echo "\t<li class=\"itemtitle\"><a href=\"{$data['href']}\" target=\"_blank\">{$data['text']}</a></li>\n";
        }
        echo "</ul>";
    }
    

    Sample Input:

    $html = <<<HTML
    <table border="0" cellpadding="0" cellspacing="0" width="100%" class="Table2">
    <tbody><tr>
        <td width="1%" valign="top" class="Title2">&nbsp;</td>
        <td width="65%" valign="top" class="Title2">Subject</td>
        <td width="1%" valign="top" class="Title2">&nbsp;</td>
        <td width="14%" valign="top" align="Center" class="Title2">Last Update</td>
        <td width="1%" valign="top" class="Title2">&nbsp;</td>
        <td width="8%" valign="top" align="Center" class="Title2">Replies</td>
        <td width="1%" valign="top" class="Title2">&nbsp;</td>
        <td width="9%" valign="top" align="Center" class="Title2">Views</td>
    </tr>
    <tr>
        <td width="1%" height="25">&nbsp;</td>
        <td width="64%" height="25" class="FootNotes2"><a href="/files/forum/2017/1/837110.php" target="_top" class="Links2">Serious dedicated study partner for U World</a> - step12013</td>
        <td width="1%" height="25">&nbsp;</td>
        <td width="14%" height="25" class="FootNotes2" align="center">02/11/17 01:50</td>
        <td width="1%" height="25">&nbsp;</td>
        <td width="8%" height="25" align="Center" class="FootNotes2">10</td>
        <td width="1%" height="25">&nbsp;</td>
        <td width="9%" height="25" align="Center" class="FootNotes2">318</td>
    </tr>
    <tr>
        <td width="1%" height="25">&nbsp;</td>
        <td width="64%" height="25" class="FootNotes2"><a href="/files/forum/2017/1/837999.php" target="_top" class="Links2">some text</a> - step12013</td>
        <td width="1%" height="25">&nbsp;</td>
        <td width="14%" height="25" class="FootNotes2" align="center">02/11/17 01:50</td>
        <td width="1%" height="25">&nbsp;</td>
        <td width="8%" height="25" align="Center" class="FootNotes2">10</td>
        <td width="1%" height="25">&nbsp;</td>
        <td width="9%" height="25" align="Center" class="FootNotes2">318</td>
    </tr>
    </tbody>
    </table>
    HTML;
    

    Output:

    <ul>
        <li class="itemtitle"><a href="/files/forum/2017/1/837110.php" target="_blank">Serious dedicated study partner for U World</a></li>
        <li class="itemtitle"><a href="/files/forum/2017/1/837999.php" target="_blank">some text</a></li>
    </ul>
    
    0 讨论(0)
  • 2021-01-11 13:38

    Using the Simple HTML DOM Parser library, you can use the following code:

    <?php
        require('simple_html_dom.php'); // you might need to change this, depending on where you saved the library file.
    
        $html = file_get_html('http://www.usmleforum.com/forum/index.php?forum=1');
    
        foreach($html->find('td.FootNotes2 a') as $element) { // find all <a>-elements inside a <td class="FootNotes2">-element
            $element->href = "http://www.usmleforum.com" . $element->href;  // you can also access only certain attributes of the elements (e.g. the url).
            echo $element.'</br>';  // do something with the elements.
        }
    ?>
    
    0 讨论(0)
  • 2021-01-11 13:40

    Instead of writing your own parser solution you could use an existing one like Symfony's DomCrawler component: http://symfony.com/doc/current/components/dom_crawler.html

    $crawler = new Crawler($returned_content);
    $linkTexts = $crawler->filterXPath('//a')->each(function (Crawler $node, $i) {
        return $node->text();
    });
    

    Or if you want to traverse the DOM tree yourself you can use DOMDocument's loadHTML http://php.net/manual/en/domdocument.loadhtml.php

    $document = new DOMDocument();
    $document->loadHTML($returned_content);
    foreach ($document->getElementsByTagName('a') as $link) {
        $text = $link->nodeValue;
    }
    

    EDIT:

    To get the links you want, the code assumes you have a $returned_content variable with the HTML you want to parse.

    // creating a new instance of DOMDocument (DOM = Document Object Model)
    $domDocument = new DOMDocument();
    // save previous libxml error reporting and set error reporting to internal
    // to be able to parse not well formed HTML doc
    $previousErrorReporting = libxml_use_internal_errors(true);
    $domDocument->loadHTML($returned_content);
    libxml_use_internal_errors($previousErrorReporting);
    $links = [];
    /** @var DOMElement $node */
    // getting all <a> element from the HTML
    foreach ($domDocument->getElementsByTagName('a') as $node) {
        $parentNode = $node->parentNode;
        // checking if the <a> is under a <td> that has class="FootNotes2"
        $isChildOfAFootNotesTd = $parentNode->nodeName === 'td' && $parentNode->getAttribute('class') === 'FootNotes2';
        // checking if the <a> has class="Links2"
        $isLinkOfLink2Class = $node->getAttribute('class') == 'Links2';
        // as I assumed you wanted links from the <td> this check makes sure that both of the above conditions are fulfilled
        if ($isChildOfAFootNotesTd && $isLinkOfLink2Class) {
            $links[] = [
                'href' => $node->getAttribute('href'),
                'text' => $parentNode->textContent,
            ];
        }
    }
    
    print_r($links);
    

    This will create you an array similar to:

    Array
    (
        [0] => Array
        (
            [href] => /files/forum/2017/1/837242.php
            [text] => Q@Q Drill Time ① - cardio69
        ) 
        [1] => Array
        (
            [href] => /files/forum/2017/1/837356.php
            [text] => study partner in Houston - lacy
        )
        [2] => Array
        (
            [href] => /files/forum/2017/1/837110.php
            [text] => Serious dedicated study partner for U World - step12013
        )
        ...
    
    0 讨论(0)
  • 2021-01-11 13:47

    I tried the same code for another site. and it works. Please take a look at it:

    <?php
        function get_data($url) {
          $ch = curl_init();
          curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
          curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
          curl_setopt($ch, CURLOPT_URL,$url);
          $result=curl_exec($ch);
          curl_close($ch);
          return $result;
        }
        $returned_content = get_data('http://www.usmle-forums.com/usmle-step-1-forum/');
        $first_step = explode( '<tbody id="threadbits_forum_26">' , $returned_content );
        $second_step = explode('</tbody>', $first_step[1]);
        $third_step = explode('<tr>', $second_step[0]);
        // print_r($third_step);
        foreach ($third_step as $element) {
          $child_first = explode( '<td class="alt1"' , $element );
          $child_second = explode( '</td>' , $child_first[1] );
          $child_third = explode( '<a href=' , $child_second[0] );
          $child_fourth = explode( '</a>' , $child_third[1] );
          echo $final = "<a href=".$child_fourth[0]."</a></br>";
        }
        ?>
    

    I know its too much to ask, but can you please make a code out of these two which make the crawler work.

    @jkmak

    0 讨论(0)
提交回复
热议问题