How to use Goutte

前端 未结 2 1230
难免孤独
难免孤独 2021-02-08 05:55

Issue:
Cannot fully understand the Goutte web scraper.

Request:
Can someone please help me understand or provide code to help

相关标签:
2条回答
  • 2021-02-08 06:04

    The documentation you want to look at is the Symfony2 DomCrawler.

    Goutte is a client build on top of Guzzle that returns Crawlers every time you request/submit something:

    use Goutte\Client;
    $client = new Client();
    $crawler = $client->request('GET', 'http://www.symfony-project.org/');
    

    With this crawler you can do stuff like get all the P tags inside the body:

    $nodeValues = $crawler->filter('body > p')->each(function (Crawler $node, $i) {
        return $node->text();
    });
    print_r($nodeValues);
    

    Fill and submit forms:

    $form = $crawler->selectButton('sign in')->form(); 
    $crawler = $client->submit($form, array(
            'username' => 'username', 
            'password' => 'xxxxxx'
    ));
    

    A selectButton() method is available on the Crawler which returns another Crawler that matches a button (input[type=submit], input[type=image], or a button) with the given text. [1]

    You click on links or set options, select check-boxes and more, see Form and Link support.

    To get data from the crawler use the html or text methods

    echo $crawler->html();
    echo $crawler->text();
    
    0 讨论(0)
  • 2021-02-08 06:04

    After much trial and error I have discovered that there is a much easier, well documented, better assitance (if needed) and much more effective scraper than goutte. If you are having issues with goutte try the following:

    1. Simple HTML Dom: http://simplehtmldom.sourceforge.net/

    If you are in the same situation as I was where the page you are trying to scrape requires a referrer from their own website then you can use a combination of CURL and Simple HTML DOM because it does not appear that Simple HTML DOM has the ability to send a referrer. If you do not need a referrer then you can use Simple HTML DOM to scrape the page.

    $url="http://www.example.com/sub-page-needs-referer/";
    $referer="http://www.example.com/";
    $html=new simple_html_dom(); // Create a new object for SIMPLE HTML DOM
    /** cURL Initialization  **/
    $ch = curl_init($url);
    
    /** Set the cURL options **/
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_REFERER,$referer);
    $output = curl_exec($ch);
    
    if($output === FALSE) {
      echo "cURL Error: ".curl_error($ch); // do something here if we couldn't scrape the page
    }
    else {
      $info = curl_getinfo($ch);
      echo "Took ".$info['total_time']." seconds for url: ".$info['url'];
      $html->load($output); // Transfer CURL to SIMPLE HTML DOM
    }
    
    /** Free up cURL **/
    curl_close($ch);
    
    // Do something with SIMPLE HTML DOM.  It is well documented and very easy to use.  They have a lot of examples.
    
    0 讨论(0)
提交回复
热议问题