Python requests isn't giving me the same HTML as my browser is

前端 未结 6 1017
失恋的感觉
失恋的感觉 2021-01-31 18:56

I am grabbing a Wikia page using Python requests. There\'s a problem, though: the requests request isn\'t giving me the same HTML as my browser is with the very

6条回答
  •  春和景丽
    2021-01-31 19:23

    (Maybe my recent experience will help)

    I faced the same issue scraping on Amazon: my local machine was able to process all the pages but, when I moved the project on a Google Cloud instance, the behavior changed for some of the items I was scraping.

    Previous implementation

    On my local machine I was using requests library as follow

    page = requests.get(url_page, headers=self.headers)
    page=page.content
    

    with headers specified in my class, based on my local browser

    headers = {
        "User-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36 OPR/67.0.3575.137"}
    

    but I get incomplete pages using this setup on Google Cloud instance

    New implementation

    The following implementation involves urllib without the headers

    req = urllib.request.Request(
                        url_page,
                        data=None
                    )
    f = urllib.request.urlopen(req)
    page = f.read().decode('utf-8')
    self.page = page
    

    this solution works on both the machines; before this attempt, I tried also using the same headers and the prolem was not solved, and so I removed the headers supposing that the problem was there (maybe because I was indentifying incorrectly as another client).

    So, my code works perfectly and I'm still able to process the content of the pages with beautifulsoup, as in the following method which I implemented in my class in order to extract the text from specific portion of the page.

     def find_data(self, div_id):
        soup = BeautifulSoup(self.page, features = "lxml")
        text = soup.select("#"+div_id)[0].get_text()
    
        text = text.strip()
        text = str(text)
        text = text.replace('"', "")
        return text
    

提交回复
热议问题