goutte

How to get meta description content using Goutte

狂风中的少年 提交于 2019-12-10 19:28:54
问题 Can you please help me to find a way to get a content from meta description, meta keywords and robots content using Goutte. Also, how can I target <link rel="stylesheet" href=""> and <script> ? Below is PHP that I used to get <title> content: require_once 'goutte.phar'; use Goutte\Client; $client = new Client(); $crawler = $client->request('GET', 'http://stackoverflow.com/'); $crawler->filter('title')->each(function ($node) { $content .= "Title: ".$node->text().""; echo $content; }); Here is

How to run PHPUnit from a PHP script?

杀马特。学长 韩版系。学妹 提交于 2019-12-10 17:25:50
问题 I am creating a custom testing application using PHPUnit and Goutte. I would like to load the Goutte library (plus any files required for the tests) within my own bootstrap file and then start the PHPUnit test runner once it is all bootstrapped. I'm not sure how to do this without calling the phpunit script externally (Which would be a seperate process, and won't be able to see my bootstrapped libraries). Has anyone done anything like this before? What is the best way to do it? 回答1: If you

Goutte won't load an ASP SSL page

…衆ロ難τιáo~ 提交于 2019-12-08 15:38:22
问题 I am trying out Goutte, the PHP web crawler based on Symfony2 components. I've successfully retrieved Google in both plaintext and SSL forms. However, I've come across an ASP/SSL page that won't load. Here's my code: // Load a crawler/browser system require_once 'vendor/goutte/goutte.phar'; // Here's a demo of a page we want to parse $uri = '(removed)'; use Goutte\Client; $client = new Client(); $crawler = $client->request('GET', $uri); echo $crawler->text() . "\n"; Instead, the echo at the

Sending multiple goutte requests asynchronously

半腔热情 提交于 2019-12-08 08:52:21
问题 This is the code I am using require_once 'goutte.phar'; use Goutte\Client; $client = new Client(); for($i=0;$i<10;$i++){ $crawler = $client->request('GET', 'http://website.com'); echo '<p>'.$crawler->filterXpath('//meta[@property="og:description"]')->attr('content').'</p>'; echo '<p>'.$crawler->filter('title')->text().'</p>'; } This works but takes a lot of time to process? Is there any way to do it faster. 回答1: For starters, there is nothing asynchronous about your code sample. Which means

DOMCrawler not dumping data properly for parsing

坚强是说给别人听的谎言 提交于 2019-12-08 08:21:43
问题 I'm using Symfony, Goutte, and DOMCrawler to scrape a page. Unfortunately, this page has many old fashioned tables of data, and no IDs or classes or identifying factors. So I'm trying to find a table by parsing through the source code I get back from the request, but I can't seem to access any information I think when I try to filter it, it only filters the first node, and that's not where my desired data is, so it returns nothing. so I have a $crawler object. And I've tried to loop through

Setting CURL Parameters for fabpot/goutte Client

大憨熊 提交于 2019-12-07 08:30:38
问题 I am working on a web crowler using goutte (fabpot/goutte). When I try to connect to an https site, it throws an error because the site is using a self signed certificate. I am trying to find the way to set the curl parameters to ignore the fact that the ssl certificate is self signed. Following the instructions in https://github.com/FriendsOfPHP/Goutte I tried the following code: $this->client = new Client(); $this->client->getClient()->setDefaultOption('config/curl/'.CURLOPT_SSL_VERIFYPEER,

Sending multiple goutte requests asynchronously

南楼画角 提交于 2019-12-06 16:16:25
This is the code I am using require_once 'goutte.phar'; use Goutte\Client; $client = new Client(); for($i=0;$i<10;$i++){ $crawler = $client->request('GET', 'http://website.com'); echo '<p>'.$crawler->filterXpath('//meta[@property="og:description"]')->attr('content').'</p>'; echo '<p>'.$crawler->filter('title')->text().'</p>'; } This works but takes a lot of time to process? Is there any way to do it faster. For starters, there is nothing asynchronous about your code sample. Which means that your application will sequentially, perform a get request, wait for the response, parse the response and

How to send Custom Headers using PHP Gouttee

ぐ巨炮叔叔 提交于 2019-12-06 14:24:12
问题 I am trying to scrape a site that actually block Bots. I have this code in PHP cURL to get away with blockage. $headers = array( 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding: zip, deflate, sdch' , 'Accept-Language:en-US,en;q=0.8' , 'Cache-Control:max-age=0', 'User-Agent:' . $user_agents[array_rand($user_agents)] ); curl_setopt($curl_init, CURLOPT_URL, $url); curl_setopt($curl_init, CURLOPT_HTTPHEADER, $headers); $output = curl_exec(

Setting CURL Parameters for fabpot/goutte Client

限于喜欢 提交于 2019-12-05 12:55:58
I am working on a web crowler using goutte (fabpot/goutte). When I try to connect to an https site, it throws an error because the site is using a self signed certificate. I am trying to find the way to set the curl parameters to ignore the fact that the ssl certificate is self signed. Following the instructions in https://github.com/FriendsOfPHP/Goutte I tried the following code: $this->client = new Client(); $this->client->getClient()->setDefaultOption('config/curl/'.CURLOPT_SSL_VERIFYPEER, false); $this->client->getClient()->setDefaultOption('config/curl/'.CURLOPT_CERTINFO, false);

How to send Custom Headers using PHP Gouttee

守給你的承諾、 提交于 2019-12-04 19:28:23
I am trying to scrape a site that actually block Bots. I have this code in PHP cURL to get away with blockage. $headers = array( 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding: zip, deflate, sdch' , 'Accept-Language:en-US,en;q=0.8' , 'Cache-Control:max-age=0', 'User-Agent:' . $user_agents[array_rand($user_agents)] ); curl_setopt($curl_init, CURLOPT_URL, $url); curl_setopt($curl_init, CURLOPT_HTTPHEADER, $headers); $output = curl_exec($curl_init); It works well. But I am using PHP Goutte , I want to generate same request using this library