Strange CURL issue with a particular website SSL certificate

前端 未结 1 891
醉话见心
醉话见心 2021-01-26 04:30

I am trying to use CURL to get web pages from a paricualr website however it gives this error:

curl -q -v -A \"Mozilla/5.0 (compatible; Googlebot/2.1; +http://ww         


        
1条回答
  •  温柔的废话
    2021-01-26 05:08

    The problem is not the certificate of this site. From the debug output it can be clearly seen that the TLS handshake is done successfully and outside this handshake the certificate does not matter.

    But, it can be seen that the site www.saiglobal.com is CDN protected by Akamai CDN and Akamai features some kind of bot detection:

    $ dig www.saiglobal.com
    ...
    www.saiglobal.com.      45      IN      CNAME   www.saiglobal.com.edgekey.net.
    www.saiglobal.com.edgekey.net. 62 IN    CNAME   e9158.a.akamaiedge.net.
    

    This bot detection is known to use some heuristics in order to distinguish bots from normal browsers and detection of a bot might result in a status code 403 access denied or in a simple hang of the site - see Scraping attempts getting 403 error or Requests SSL connection timeout.

    In this specific case it seems to currently help if some specific HTTP headers are added, specifically Accept-Encoding, Accept-Language, Connection with a value of keep-alive and User-Agent which matches somehow Mozilla. Failure to add these headers or having the wrong values will result in a hang.

    The following works currently for me:

    $ curl -q -v \
       -H "Connection: keep-alive" \
       -H "Accept-Encoding: identity" \
       -H "Accept-Language: en-US" \
       -H "User-Agent: Mozilla/5.0"  \
       https://www.saiglobal.com/
    

    Note that this deliberately tries to bypass the bot detection. It might stop working if Akamai makes changes to the bot detection.

    Please note also that the owner of the site has explicitly enable bot detection for a reason. This means that with deliberately bypassing the detection for your own gain (like providing some service based on scraped information) you might get into legal problems.

    0 讨论(0)
提交回复
热议问题