Python requests isn't giving me the same HTML as my browser is

前端 未结 6 1014
失恋的感觉
失恋的感觉 2021-01-31 18:56

I am grabbing a Wikia page using Python requests. There\'s a problem, though: the requests request isn\'t giving me the same HTML as my browser is with the very

相关标签:
6条回答
  • 2021-01-31 19:18

    I had a similar issue:

    • Identical headers with Python and through the browser
    • JavaScript definitely ruled out as a cause

    To resolve the issue, I ended up swapping out the requests library for urllib.request.

    Basically, I replaced:

    import requests
    
    session = requests.Session()
    r = session.get(URL)
    

    with:

    import urllib.request
    
    r = urllib.request.urlopen(URL)
    

    and then it worked.

    Maybe one of those libraries is doing something strange behind the scenes? Not sure if that's an option for you or not.

    0 讨论(0)
  • 2021-01-31 19:19

    Maybe Requests and Browsers use different ways to render the raw data from WEB server, and the diff in the above example are only with the rendered html.

    I found that when html is broken, different browsers, e.g. Chrome and Safari, use different ways to fix when parsing. So maybe it is the same idea with Requests and Firefox.

    From both Requests and Firefox I suggest to diff the raw data, i.e. the byte stream in socket. Requests can use .raw property of response object to get the raw data in socket. (http://docs.python-requests.org/en/master/user/quickstart/) If the raw data from both sides are same and there are some broken codes in HTML, maybe it is due to the different auto-fixing policies of Request and browser when parsing broken html.

    0 讨论(0)
  • 2021-01-31 19:23

    (Maybe my recent experience will help)

    I faced the same issue scraping on Amazon: my local machine was able to process all the pages but, when I moved the project on a Google Cloud instance, the behavior changed for some of the items I was scraping.

    Previous implementation

    On my local machine I was using requests library as follow

    page = requests.get(url_page, headers=self.headers)
    page=page.content
    

    with headers specified in my class, based on my local browser

    headers = {
        "User-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36 OPR/67.0.3575.137"}
    

    but I get incomplete pages using this setup on Google Cloud instance

    New implementation

    The following implementation involves urllib without the headers

    req = urllib.request.Request(
                        url_page,
                        data=None
                    )
    f = urllib.request.urlopen(req)
    page = f.read().decode('utf-8')
    self.page = page
    

    this solution works on both the machines; before this attempt, I tried also using the same headers and the prolem was not solved, and so I removed the headers supposing that the problem was there (maybe because I was indentifying incorrectly as another client).

    So, my code works perfectly and I'm still able to process the content of the pages with beautifulsoup, as in the following method which I implemented in my class in order to extract the text from specific portion of the page.

     def find_data(self, div_id):
        soup = BeautifulSoup(self.page, features = "lxml")
        text = soup.select("#"+div_id)[0].get_text()
    
        text = text.strip()
        text = str(text)
        text = text.replace('"', "")
        return text
    
    0 讨论(0)
  • 2021-01-31 19:30

    I suggest that you're not sending the proper header (or sending it wrong) with your request. That's why you are getting different content. Here is an example of a HTTP request with header:

    url = 'https://www.google.co.il/search?q=eminem+twitter'
    user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36'
    
    # header variable
    headers = { 'User-Agent' : user_agent }
    
    # creating request
    req = urllib2.Request(url, None, headers)
    
    # getting html
    html = urllib2.urlopen(req).read()
    

    If you are sure that you are sending right header, but are still getting different html. You can try to use selenium. It will allows you to work with browser directly (or with phantomjs if your machine doesn't have GUI). With selenium you will be able just to grab html directly from browser.

    0 讨论(0)
  • 2021-01-31 19:32

    I was facing similar issue while requesting a page. Then I noticed that the URL which I was using required 'http' to be prepended to the URL but I was prepending 'https'. My request URL looked like https://example.com. So make the URL look like http://example.com. Hope it solves the problem.

    0 讨论(0)
  • 2021-01-31 19:36

    A lot of the differences I see are showing me that the content is still there, it's just rendered in a different order, sometimes with different spacing.

    You could be receiving different content based on multiple different things:

    • Your headers
    • Your user agent
    • The time!
    • The order which the web application decides to render elements on the page, subject to random attribute order as the element may be pulled from an unsorted data source.

    If you could include all of your headers at the top of that Diff, then we may be able to make more sense of it.

    I suspect that the application chose not to render certain images as they aren't optimized for what it thinks is some kind of robot/mobile device (Python Requests)

    On a closer look at the diff, it appears that everything was loaded in both requests, just with a different formatting.

    0 讨论(0)
提交回复
热议问题