Python requests isn't giving me the same HTML as my browser is

前端未结

关注

 6  1014

I am grabbing a Wikia page using Python requests. There\'s a problem, though: the requests request isn\'t giving me the same HTML as my browser is with the very

相关标签:

6条回答

后悔当初

2021-01-31 19:18
I had a similar issue:
- Identical headers with Python and through the browser
- JavaScript definitely ruled out as a cause
To resolve the issue, I ended up swapping out the requests library for urllib.request.

Basically, I replaced:
```
import requests

session = requests.Session()
r = session.get(URL)
```
with:
```
import urllib.request

r = urllib.request.urlopen(URL)
```
and then it worked.

Maybe one of those libraries is doing something strange behind the scenes? Not sure if that's an option for you or not.
0 讨论(0)
发布评论:

提交评论
- 加载中...
孤街浪徒

2021-01-31 19:19

Maybe Requests and Browsers use different ways to render the raw data from WEB server, and the diff in the above example are only with the rendered html.

I found that when html is broken, different browsers, e.g. Chrome and Safari, use different ways to fix when parsing. So maybe it is the same idea with Requests and Firefox.

From both Requests and Firefox I suggest to diff the raw data, i.e. the byte stream in socket. Requests can use .raw property of response object to get the raw data in socket. (http://docs.python-requests.org/en/master/user/quickstart/) If the raw data from both sides are same and there are some broken codes in HTML, maybe it is due to the different auto-fixing policies of Request and browser when parsing broken html.

0 讨论(0)
发布评论:

提交评论
- 加载中...
春和景丽

2021-01-31 19:23
(Maybe my recent experience will help)

I faced the same issue scraping on Amazon: my local machine was able to process all the pages but, when I moved the project on a Google Cloud instance, the behavior changed for some of the items I was scraping.

Previous implementation

On my local machine I was using requests library as follow
```
page = requests.get(url_page, headers=self.headers)
page=page.content
```
with headers specified in my class, based on my local browser
```
headers = {
    "User-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36 OPR/67.0.3575.137"}
```
but I get incomplete pages using this setup on Google Cloud instance

New implementation

The following implementation involves urllib without the headers
```
req = urllib.request.Request(
                    url_page,
                    data=None
                )
f = urllib.request.urlopen(req)
page = f.read().decode('utf-8')
self.page = page
```
this solution works on both the machines; before this attempt, I tried also using the same headers and the prolem was not solved, and so I removed the headers supposing that the problem was there (maybe because I was indentifying incorrectly as another client).

So, my code works perfectly and I'm still able to process the content of the pages with beautifulsoup, as in the following method which I implemented in my class in order to extract the text from specific portion of the page.
```
 def find_data(self, div_id):
    soup = BeautifulSoup(self.page, features = "lxml")
    text = soup.select("#"+div_id)[0].get_text()

    text = text.strip()
    text = str(text)
    text = text.replace('"', "")
    return text
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
再見小時候

2021-01-31 19:30
I suggest that you're not sending the proper header (or sending it wrong) with your request. That's why you are getting different content. Here is an example of a HTTP request with header:
```
url = 'https://www.google.co.il/search?q=eminem+twitter'
user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36'

# header variable
headers = { 'User-Agent' : user_agent }

# creating request
req = urllib2.Request(url, None, headers)

# getting html
html = urllib2.urlopen(req).read()
```
If you are sure that you are sending right header, but are still getting different html. You can try to use selenium. It will allows you to work with browser directly (or with phantomjs if your machine doesn't have GUI). With selenium you will be able just to grab html directly from browser.
0 讨论(0)
发布评论:

提交评论
- 加载中...
陌清茗

2021-01-31 19:32

I was facing similar issue while requesting a page. Then I noticed that the URL which I was using required 'http' to be prepended to the URL but I was prepending 'https'. My request URL looked like https://example.com. So make the URL look like http://example.com. Hope it solves the problem.

0 讨论(0)
发布评论:

提交评论
- 加载中...
野性不改

2021-01-31 19:36
A lot of the differences I see are showing me that the content is still there, it's just rendered in a different order, sometimes with different spacing.

You could be receiving different content based on multiple different things:
- Your headers
- Your user agent
- The time!
- The order which the web application decides to render elements on the page, subject to random attribute order as the element may be pulled from an unsorted data source.
If you could include all of your headers at the top of that Diff, then we may be able to make more sense of it.

I suspect that the application chose not to render certain images as they aren't optimized for what it thinks is some kind of robot/mobile device (Python Requests)

On a closer look at the diff, it appears that everything was loaded in both requests, just with a different formatting.
0 讨论(0)
发布评论:

提交评论
- 加载中...

Python requests isn't giving me the same HTML as my browser is

Previous implementation

New implementation