I am grabbing a Wikia page using Python requests. There\'s a problem, though: the requests request isn\'t giving me the same HTML as my browser is with the very
(Maybe my recent experience will help)
I faced the same issue scraping on Amazon: my local machine was able to process all the pages but, when I moved the project on a Google Cloud instance, the behavior changed for some of the items I was scraping.
On my local machine I was using requests library as follow
page = requests.get(url_page, headers=self.headers)
page=page.content
with headers specified in my class, based on my local browser
headers = {
"User-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36 OPR/67.0.3575.137"}
but I get incomplete pages using this setup on Google Cloud instance
The following implementation involves urllib without the headers
req = urllib.request.Request(
url_page,
data=None
)
f = urllib.request.urlopen(req)
page = f.read().decode('utf-8')
self.page = page
this solution works on both the machines; before this attempt, I tried also using the same headers and the prolem was not solved, and so I removed the headers supposing that the problem was there (maybe because I was indentifying incorrectly as another client).
So, my code works perfectly and I'm still able to process the content of the pages with beautifulsoup, as in the following method which I implemented in my class in order to extract the text from specific portion of the page.
def find_data(self, div_id):
soup = BeautifulSoup(self.page, features = "lxml")
text = soup.select("#"+div_id)[0].get_text()
text = text.strip()
text = str(text)
text = text.replace('"', "")
return text