goutte | 易学教程

How to get meta description content using Goutte

阅读更多关于 How to get meta description content using Goutte

问题 Can you please help me to find a way to get a content from meta description, meta keywords and robots content using Goutte. Also, how can I target <link rel="stylesheet" href=""> and <script> ? Below is PHP that I used to get <title> content: require_once 'goutte.phar'; use Goutte\Client; $client = new Client(); $crawler = $client->request('GET', 'http://stackoverflow.com/'); $crawler->filter('title')->each(function ($node) { $content .= "Title: ".$node->text().""; echo $content; }); Here is

How to run PHPUnit from a PHP script?

阅读更多关于 How to run PHPUnit from a PHP script?

问题 I am creating a custom testing application using PHPUnit and Goutte. I would like to load the Goutte library (plus any files required for the tests) within my own bootstrap file and then start the PHPUnit test runner once it is all bootstrapped. I'm not sure how to do this without calling the phpunit script externally (Which would be a seperate process, and won't be able to see my bootstrapped libraries). Has anyone done anything like this before? What is the best way to do it? 回答1: If you

Goutte won't load an ASP SSL page

阅读更多关于 Goutte won't load an ASP SSL page

问题 I am trying out Goutte, the PHP web crawler based on Symfony2 components. I've successfully retrieved Google in both plaintext and SSL forms. However, I've come across an ASP/SSL page that won't load. Here's my code: // Load a crawler/browser system require_once 'vendor/goutte/goutte.phar'; // Here's a demo of a page we want to parse $uri = '(removed)'; use Goutte\Client; $client = new Client(); $crawler = $client->request('GET', $uri); echo $crawler->text() . "\n"; Instead, the echo at the

Sending multiple goutte requests asynchronously

阅读更多关于 Sending multiple goutte requests asynchronously

问题 This is the code I am using require_once 'goutte.phar'; use Goutte\Client; $client = new Client(); for($i=0;$i<10;$i++){ $crawler = $client->request('GET', 'http://website.com'); echo ''.$crawler->filterXpath('//meta[@property="og:description"]')->attr('content').''; echo ''.$crawler->filter('title')->text().''; } This works but takes a lot of time to process? Is there any way to do it faster. 回答1: For starters, there is nothing asynchronous about your code sample. Which means

DOMCrawler not dumping data properly for parsing

阅读更多关于 DOMCrawler not dumping data properly for parsing

问题 I'm using Symfony, Goutte, and DOMCrawler to scrape a page. Unfortunately, this page has many old fashioned tables of data, and no IDs or classes or identifying factors. So I'm trying to find a table by parsing through the source code I get back from the request, but I can't seem to access any information I think when I try to filter it, it only filters the first node, and that's not where my desired data is, so it returns nothing. so I have a $crawler object. And I've tried to loop through

Setting CURL Parameters for fabpot/goutte Client

阅读更多关于 Setting CURL Parameters for fabpot/goutte Client

问题 I am working on a web crowler using goutte (fabpot/goutte). When I try to connect to an https site, it throws an error because the site is using a self signed certificate. I am trying to find the way to set the curl parameters to ignore the fact that the ssl certificate is self signed. Following the instructions in https://github.com/FriendsOfPHP/Goutte I tried the following code: $this->client = new Client(); $this->client->getClient()->setDefaultOption('config/curl/'.CURLOPT_SSL_VERIFYPEER,

Sending multiple goutte requests asynchronously

阅读更多关于 Sending multiple goutte requests asynchronously

This is the code I am using require_once 'goutte.phar'; use Goutte\Client; $client = new Client(); for($i=0;$i<10;$i++){ $crawler = $client->request('GET', 'http://website.com'); echo ''.$crawler->filterXpath('//meta[@property="og:description"]')->attr('content').''; echo ''.$crawler->filter('title')->text().''; } This works but takes a lot of time to process? Is there any way to do it faster. For starters, there is nothing asynchronous about your code sample. Which means that your application will sequentially, perform a get request, wait for the response, parse the response and

How to send Custom Headers using PHP Gouttee

阅读更多关于 How to send Custom Headers using PHP Gouttee

问题 I am trying to scrape a site that actually block Bots. I have this code in PHP cURL to get away with blockage. $headers = array( 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding: zip, deflate, sdch' , 'Accept-Language:en-US,en;q=0.8' , 'Cache-Control:max-age=0', 'User-Agent:' . $user_agents[array_rand($user_agents)] ); curl_setopt($curl_init, CURLOPT_URL, $url); curl_setopt($curl_init, CURLOPT_HTTPHEADER, $headers); $output = curl_exec(

Setting CURL Parameters for fabpot/goutte Client

阅读更多关于 Setting CURL Parameters for fabpot/goutte Client

I am working on a web crowler using goutte (fabpot/goutte). When I try to connect to an https site, it throws an error because the site is using a self signed certificate. I am trying to find the way to set the curl parameters to ignore the fact that the ssl certificate is self signed. Following the instructions in https://github.com/FriendsOfPHP/Goutte I tried the following code: $this->client = new Client(); $this->client->getClient()->setDefaultOption('config/curl/'.CURLOPT_SSL_VERIFYPEER, false); $this->client->getClient()->setDefaultOption('config/curl/'.CURLOPT_CERTINFO, false);

How to send Custom Headers using PHP Gouttee

阅读更多关于 How to send Custom Headers using PHP Gouttee

I am trying to scrape a site that actually block Bots. I have this code in PHP cURL to get away with blockage. $headers = array( 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding: zip, deflate, sdch' , 'Accept-Language:en-US,en;q=0.8' , 'Cache-Control:max-age=0', 'User-Agent:' . $user_agents[array_rand($user_agents)] ); curl_setopt($curl_init, CURLOPT_URL, $url); curl_setopt($curl_init, CURLOPT_HTTPHEADER, $headers); $output = curl_exec($curl_init); It works well. But I am using PHP Goutte , I want to generate same request using this library