Issue:
Cannot fully understand the Goutte web scraper.
Request:
Can someone please help me understand or provide code to help
After much trial and error I have discovered that there is a much easier, well documented, better assitance (if needed) and much more effective scraper than goutte. If you are having issues with goutte try the following:
If you are in the same situation as I was where the page you are trying to scrape requires a referrer from their own website then you can use a combination of CURL and Simple HTML DOM because it does not appear that Simple HTML DOM has the ability to send a referrer. If you do not need a referrer then you can use Simple HTML DOM to scrape the page.
$url="http://www.example.com/sub-page-needs-referer/";
$referer="http://www.example.com/";
$html=new simple_html_dom(); // Create a new object for SIMPLE HTML DOM
/** cURL Initialization **/
$ch = curl_init($url);
/** Set the cURL options **/
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_REFERER,$referer);
$output = curl_exec($ch);
if($output === FALSE) {
echo "cURL Error: ".curl_error($ch); // do something here if we couldn't scrape the page
}
else {
$info = curl_getinfo($ch);
echo "Took ".$info['total_time']." seconds for url: ".$info['url'];
$html->load($output); // Transfer CURL to SIMPLE HTML DOM
}
/** Free up cURL **/
curl_close($ch);
// Do something with SIMPLE HTML DOM. It is well documented and very easy to use. They have a lot of examples.