Goutte won't load an ASP SSL page

…衆ロ難τιáo~ 提交于 2019-12-08 15:38:22

问题


I am trying out Goutte, the PHP web crawler based on Symfony2 components. I've successfully retrieved Google in both plaintext and SSL forms. However, I've come across an ASP/SSL page that won't load.

Here's my code:

// Load a crawler/browser system
require_once 'vendor/goutte/goutte.phar';

// Here's a demo of a page we want to parse
$uri = '(removed)';

use Goutte\Client;

$client = new Client();
$crawler = $client->request('GET', $uri);
echo $crawler->text() . "\n";

Instead, the echo at the end of the above code, for this one site, gives me this:

Bad Request (Invalid Header Name)

I can see the site fine in Firefox, and the HTML for it can be retrieved fine using wget --no-check-certificate with no other options (setting the header or user agent, for example).

I suspect I need to set some HTTP headers in Goutte. Has anyone any ideas which ones I should try?


回答1:


I discovered that my browser and wget both add a non-empty user agent field in the header, so I am assuming Goutte sets nothing here. Adding this header to the browser object prior to the fetch fixes the problem:

// Load a crawler/browser system
require_once 'vendor/goutte/goutte.phar';

// Here's a demo of a page we want to parse
$uri = '(removed)';

use Goutte\Client;

// Set up headers
$client = new Client();
$headers = array(
    'User-Agent' => 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:21.0) Gecko/20100101 Firefox/21.0',
);
foreach ($headers as $header => $value)
{
    $client->setHeader($header, $value);
}

$crawler = $client->request('GET', $uri);
echo $crawler->text() . "\n";

Here I've copied in my browser agent string, but in this case I think anything would work - as long as it is set.

Incidentally, I used a browser UA here as I was trying to accurately replicate the browser environment for debugging this particular problem. Once it worked I switched to a custom UA, so target sites can detect it as a bot if they wish to (for this project I don't think anyone has).




回答2:


I had this problems too.

Adding User-Agent header was not enough. I added HTTP_USER_AGENT using setServerParameter function and it worked like a charm.

Here's the complete code:

// Load a crawler/browser system
require_once 'vendor/goutte/goutte.phar';

// Here's a demo of a page we want to parse
$uri = '(removed)';
$ua = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:21.0) Gecko/20100101 Firefox/21.0';

use Goutte\Client;

// Set up headers
$client = new Client();
$client->setHeader('User-Agent', $ua);
$client->setServerParameter('HTTP_USER_AGENT', $ua);

$crawler = $client->request('GET', $uri);
echo $crawler->text() . "\n";


来源:https://stackoverflow.com/questions/17180837/goutte-wont-load-an-asp-ssl-page

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!