python requests & beautifulsoup bot detection

前端 未结 3 1225
一整个雨季
一整个雨季 2021-01-22 15:25

I\'m trying to scrape all the HTML elements of a page using requests & beautifulsoup. I\'m using ASIN (Amazon Standard Identification Number) to get the product details of a

相关标签:
3条回答
  • 2021-01-22 15:36

    try this:

    import requests
    from bs4 import BeautifulSoup
    
    url = "http://www.amazon.com/dp/" + 'B004CNH98C'
    r = requests.get(url)
    r = r.text
    
    ##options #1
    #  print r.text
    
    soup = BeautifulSoup( r.encode("utf-8") , "html.parser")
    
    ### options 2
    print(soup)
    
    0 讨论(0)
  • 2021-01-22 15:39

    As some of the comments already suggested, if you need to somehow interact with Javascript on a page, it is better to use selenium. However, regarding your first approach using a header:

    import requests
    from bs4 import BeautifulSoup
    
    url = "http://www.amazon.com/dp/" + 'B004CNH98C'
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
    
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text,"html.parser")
    

    These headers are a bit old, but should still work. By using them you are pretending that your request is coming from a normal webbrowser. If you use requests without such a header your code is basically telling the server that the request is coming from python, which most of the servers are rejecting right away.

    Another alternative for you could also be fake-useragent maybe you can also have a try with this.

    0 讨论(0)
  • 2021-01-22 15:43

    It is better to use fake_useragent here for making things easy. A random user agent sends request via real world browser usage statistic. If you don't need dynamic content, you're almost always better off just requesting the page content over HTTP and parsing it programmatically.

    import requests
    from fake_useragent import UserAgent
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
    ua=UserAgent()
    hdr = {'User-Agent': ua.random,
          'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
          'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
          'Accept-Encoding': 'none',
          'Accept-Language': 'en-US,en;q=0.8',
          'Connection': 'keep-alive'}
    url = "http://www.amazon.com/dp/" + 'B004CNH98C'
    response = requests.get(url, headers=hdr)
    print response.content
    

    Selenium is used for browser automation and high level web scraping for dynamic contents.

    0 讨论(0)
提交回复
热议问题