How to scrape a SSL or HTTPS URL

前端 未结 1 2078
隐瞒了意图╮
隐瞒了意图╮ 2020-12-12 03:33

I have written a function to scrape a website using CURL but it returns nothing when called and can\'t understand why. The output is empty

  

        
相关标签:
1条回答
  • 2020-12-12 03:37

    There are 2 possible fixes when trying to scrape a ssl or https url:

    1. The quick fix
    2. The proper fix

    The quick fix, first.

    Warning: this can introduce security issues that SSL is designed to protect against.

    set: CURLOPT_SSL_VERIFYPEER => false

    The second, and proper fix. Set 3 options:

    1. CURLOPT_SSL_VERIFYPEER => true
    2. CURLOPT_SSL_VERIFYHOST => 2
    3. CURLOPT_CAINFO => getcwd() . '\CAcert.pem'

    The last thing you need to do is download the CA certificate.

    Go to, - http://curl.haxx.se/docs/caextract.html -> click 'cacert.pem' -> copie/paste the text in to a text editor -> save the file as 'CAcert.pem' Check it isn't 'CAcert.pem.txt'

    <?php
        function scrape($url)
        {
            $headers = Array(
                        "Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5",
                        "Cache-Control: max-age=0",
                        "Connection: keep-alive",
                        "Keep-Alive: 300",
                        "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7",
                        "Accept-Language: en-us,en;q=0.5",
                        "Pragma: "
                    );
            $config = Array(
                            CURLOPT_SSL_VERIFYPEER => true,
                            CURLOPT_SSL_VERIFYHOST => 2,
                            CURLOPT_CAINFO => getcwd() . '\CAcert.pem',
                            CURLOPT_RETURNTRANSFER => TRUE ,
                            CURLOPT_FOLLOWLOCATION => TRUE ,
                            CURLOPT_AUTOREFERER => TRUE ,
                            CURLOPT_CONNECTTIMEOUT => 120 ,
                            CURLOPT_TIMEOUT => 120 ,
                            CURLOPT_MAXREDIRS => 10 ,                   
                            CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8" ,
                            CURLOPT_URL => $url
                           ) ;
            $handle = curl_init() ;
            curl_setopt_array($handle,$config) ;
            curl_setopt($handle,CURLOPT_HTTPHEADER,$headers) ;
            $output->data = curl_exec($handle) ;
    
            if(curl_exec($handle) === false) {
                $output->error = 'Curl error: ' . curl_error($handle);
            } else {
                $output->error = 'Operation completed without any errors';
            }
    
            curl_close($handle) ;
            return $output ;
        }
    
    $scrape = scrape("https://www.google.com") ;
    
    echo $scrape->data;
    
    //uncomment for errors
    //echo $scrape->error;
    ?>
    
    0 讨论(0)
提交回复
热议问题