python urllib2 - wait for page to finish loading/redirecting before scraping?

笑着哭i 提交于 2019-12-28 17:42:32

问题


I'm learning to make web scrapers and want to scrape TripAdvisor for a personal project, grabbing the html using urllib2. However, I'm running into a problem where, using the code below, the html I get back is not correct as the page seems to take a second to redirect (you can verify this by visiting the url) - instead I get the code from the page that initially briefly appears.

Is there some behavior or parameter to set to make sure the page has completely finished loading/redirecting before getting the website content?

import urllib2
from bs4 import BeautifulSoup

bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")
soup = BeautifulSoup(bostonPage)
print soup.prettify()

Edit: The answer is thorough, however, in the end what solved my problem was this: https://stackoverflow.com/a/3210737/1157283


回答1:


Inreresting the problem isn't a redirect is that page modifies the content using javascript, but urllib2 doesn't have a JS engine it just GETS data, if you disabled javascript on your browser you will note it loads basically the same content as what urllib2 returns

import urllib2
from BeautifulSoup import BeautifulSoup

bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")
soup = BeautifulSoup(bostonPage)
open('test.html', 'w').write(soup.read())

test.html and disabling JS in your browser, easiest in firefox content -> uncheck enable javascript, generates identical result sets.

So what can we do well, first we should check if the site offers an API, scrapping tends to be frown up http://www.tripadvisor.com/help/what_type_of_tripadvisor_content_is_available

Travel/Hotel API's? it looks they might, though with some restrictions.

But if we still need to scrape it, with JS, then we can use selenium http://seleniumhq.org/ its mainly used for testing, but its easy and has fairly good docs.

I also found this Scraping websites with Javascript enabled? and this http://grep.codeconsult.ch/2007/02/24/crowbar-scrape-javascript-generated-pages-via-gecko-and-rest/

hope that helps.

As a side note:

>>> import urllib2
>>> from bs4 import BeautifulSoup
>>> 
>>> bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")
>>> value = bostonPage.read()
>>> soup = BeautifulSoup(value)
>>> open('test.html', 'w').write(value)


来源:https://stackoverflow.com/questions/11460105/python-urllib2-wait-for-page-to-finish-loading-redirecting-before-scraping

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!