How to scrape a SSL or HTTPS URL

前端未结

关注

 1  2079

I have written a function to scrape a website using CURL but it returns nothing when called and can\'t understand why. The output is empty


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  走了就别回头了        
                
              
                            
                2020-12-12 03:37
              
            
            
                                                                       
There are 2 possible fixes when trying to scrape a ssl or https url:


The quick fix    
The proper fix


The quick fix, first.

Warning: this can introduce security issues that SSL is designed to protect against.

set: CURLOPT_SSL_VERIFYPEER => false

The second, and proper fix. Set 3 options:


CURLOPT_SSL_VERIFYPEER => true    
CURLOPT_SSL_VERIFYHOST => 2   
CURLOPT_CAINFO => getcwd() . '\CAcert.pem'


The last thing you need to do is download the CA certificate.

Go to, - http://curl.haxx.se/docs/caextract.html -> click 'cacert.pem' -> copie/paste the text in to a text editor -> save the file as 'CAcert.pem' Check it isn't 'CAcert.pem.txt'

<?php
    function scrape($url)
    {
        $headers = Array(
                    "Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5",
                    "Cache-Control: max-age=0",
                    "Connection: keep-alive",
                    "Keep-Alive: 300",
                    "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7",
                    "Accept-Language: en-us,en;q=0.5",
                    "Pragma: "
                );
        $config = Array(
                        CURLOPT_SSL_VERIFYPEER => true,
                        CURLOPT_SSL_VERIFYHOST => 2,
                        CURLOPT_CAINFO => getcwd() . '\CAcert.pem',
                        CURLOPT_RETURNTRANSFER => TRUE ,
                        CURLOPT_FOLLOWLOCATION => TRUE ,
                        CURLOPT_AUTOREFERER => TRUE ,
                        CURLOPT_CONNECTTIMEOUT => 120 ,
                        CURLOPT_TIMEOUT => 120 ,
                        CURLOPT_MAXREDIRS => 10 ,                   
                        CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8" ,
                        CURLOPT_URL => $url
                       ) ;
        $handle = curl_init() ;
        curl_setopt_array($handle,$config) ;
        curl_setopt($handle,CURLOPT_HTTPHEADER,$headers) ;
        $output->data = curl_exec($handle) ;

        if(curl_exec($handle) === false) {
            $output->error = 'Curl error: ' . curl_error($handle);
        } else {
            $output->error = 'Operation completed without any errors';
        }

        curl_close($handle) ;
        return $output ;
    }

$scrape = scrape("https://www.google.com") ;

echo $scrape->data;

//uncomment for errors
//echo $scrape->error;
?>

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复