Python requests isn't giving me the same HTML as my browser is

前端未结

关注

 6  1017

失恋的感觉 2021-01-31 18:56

I am grabbing a Wikia page using Python requests. There\'s a problem, though: the requests request isn\'t giving me the same HTML as my browser is with the very

6条回答

春和景丽 (楼主)

2021-01-31 19:23
(Maybe my recent experience will help)

I faced the same issue scraping on Amazon: my local machine was able to process all the pages but, when I moved the project on a Google Cloud instance, the behavior changed for some of the items I was scraping.

Previous implementation

On my local machine I was using requests library as follow
```
page = requests.get(url_page, headers=self.headers)
page=page.content
```
with headers specified in my class, based on my local browser
```
headers = {
    "User-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36 OPR/67.0.3575.137"}
```
but I get incomplete pages using this setup on Google Cloud instance

New implementation

The following implementation involves urllib without the headers
```
req = urllib.request.Request(
                    url_page,
                    data=None
                )
f = urllib.request.urlopen(req)
page = f.read().decode('utf-8')
self.page = page
```
this solution works on both the machines; before this attempt, I tried also using the same headers and the prolem was not solved, and so I removed the headers supposing that the problem was there (maybe because I was indentifying incorrectly as another client).

So, my code works perfectly and I'm still able to process the content of the pages with beautifulsoup, as in the following method which I implemented in my class in order to extract the text from specific portion of the page.
```
 def find_data(self, div_id):
    soup = BeautifulSoup(self.page, features = "lxml")
    text = soup.select("#"+div_id)[0].get_text()

    text = text.strip()
    text = str(text)
    text = text.replace('"', "")
    return text
```
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...

Python requests isn't giving me the same HTML as my browser is

Previous implementation

New implementation