问题
I'm trying to load one web page. Then scroll to the very bottom of this page (there is an infinite scroll) and get a page source code.
Scrolling and loading seems to work correct but driver.page_source
returns very short html
which is just a little part of the whole page source
.
def scroll_to_the_bottom(driver):
old_html = ''
new_html = driver.page_source
while old_html != new_html:
print 'SCROLL'
old_html = driver.page_source
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(3)
new_html = driver.page_source
driver.get("http://www.citypaper.com/music/short-list/bcpnews-the-short-list-weird-al-the-heartless-bastards-chastity-belt-more-20150609-story.html")
scroll_to_the_bottom(driver)
print driver.page_source
CONSOLE:
<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" data-role="base navhead resizescroll imgsize metrics oopadloader socialshare panelmod transporter"><head><script type="text/javascript" async="" src="//ml314.com/tag.aspx?2972015"></script><script type="text/javascript" async="" src="//ml314.com/tag.aspx?2972015"></script><script async="" src="http://b.scorecardresearch.com/beacon.js"></script><script async="" src="//www.google-analytics.com/analytics.js"></script><script type="text/javascript" src="http://beacon.krxd.net/optout_check?callback=Krux.ns._default.kxjsonp_optOutCheck"></script><script charset="UTF-8" type="text/javascript" src="http://cdn.taboola.com/libtrc/impl.174-RELEASE.js"></script><script async="" src="//widget.perfectmarket.com/tribunedigital-network/load.js"></script><script async="" src="http://b.scorecardresearch.com/beacon.js"></script>
<title>Music Boxes - Baltimore City Paper</title>
<link rel="dns-prefetch" href="//www.trbimg.com" /><link rel="dns-prefetch" href="//static.chartbeat.com" /><link rel="dns-prefetch" href="//loggingservices.tribune.com" /><link rel="dns-prefetch" href="//m.trb.com" /><link rel="dns-prefetch" href="//b.scorecardresearch.com" /><link rel="dns-prefetch" href="//www.google-analytics.com" /><link rel="dns-prefetch" href="http://pubads.g.doubleclick.net" /><link rel="dns-prefetch" href="https://securepubads.g.doubleclick.net" /><link rel="dns-prefetch" href="//secure-us.imrworldwide.com" /><link rel="dns-prefetch" href="//www.googletagservices.com" /><link rel="dns-prefetch" href="http://ssor.tribdss.com" /><link rel="dns-prefetch" href="//cdn.krxd.net" /><link rel="dns-prefetch" href="//cdn.gigya.com" /><link rel="dns-prefetch" href="//cdn.taboola.com" /><meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, minimum-scale=1.0, user-scalable=no" />
<meta charset="utf-8" />
<meta name="x-servername" content="i10latisrapp02" />
<meta name="robots" content="noodp, noydir" />
I use chromedriver
so I can clearly see that it scrolls to the bottom. Where could be the problem please?
EDIT:
I've added a web page scraped.
回答1:
You cannot rely on page_source
to get the current state of the page. The Python docs do not point it out but if you look at the Java docs of Selenium for getPageSource you'll see:
If the page has been modified after loading (for example, by Javascript) there is no guarantee that the returned text is that of the modified page.
What you can do is ask the browser to serialize the DOM. This will produce HTML that represents the DOM at the time you make the call:
driver.execute_script("return document.documentElement.outerHTML")
回答2:
Are you aware that the page content loads/unloads as you scroll down? The page is unloading previous sections as you scroll down. For instance, scroll all the way down to the bottom of the page and start scrolling back up. You will see that it's loading previous sections.
To prove this... when you first load the page, the first article title is, "The Short List: Weird Al, the Heartless Bastards, Chastity Belt, more". Scroll to the bottom of the page, pull the HTML source (manually), and search for that title. It's not there.
So, I don't know what you are trying to do but if all you want to do is to load the last section you can navigate directly to the last section using the URL, http://www.citypaper.com/music/music-boxes/
The different sections are:
Main article
http://www.citypaper.com/music/music-features/
http://www.citypaper.com/music/listening-party/
http://www.citypaper.com/music/music-boxes/
Why are you wanting the HTML source of the page anyway? What are you trying to accomplish? One of the main points of using Selenium is so you can find HTML tags using locators so you don't have to parse source, etc.
回答3:
I had similar problem. I used time.sleep(5) after get.page_source so that the contents can be read.
来源:https://stackoverflow.com/questions/32285070/driver-doesnt-return-proper-page-source