Driver doesn't return proper page source

问题

I'm trying to load one web page. Then scroll to the very bottom of this page (there is an infinite scroll) and get a page source code.

Scrolling and loading seems to work correct but driver.page_source returns very short html which is just a little part of the whole page source.

def scroll_to_the_bottom(driver):
    old_html = ''
    new_html = driver.page_source
    while old_html != new_html:
        print 'SCROLL'
        old_html = driver.page_source
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(3)
        new_html = driver.page_source


driver.get("http://www.citypaper.com/music/short-list/bcpnews-the-short-list-weird-al-the-heartless-bastards-chastity-belt-more-20150609-story.html")
scroll_to_the_bottom(driver)
print driver.page_source

CONSOLE:

<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" data-role="base navhead resizescroll imgsize metrics oopadloader socialshare panelmod transporter"><head><script type="text/javascript" async="" src="//ml314.com/tag.aspx?2972015"></script><script type="text/javascript" async="" src="//ml314.com/tag.aspx?2972015"></script><script async="" src="http://b.scorecardresearch.com/beacon.js"></script><script async="" src="//www.google-analytics.com/analytics.js"></script><script type="text/javascript" src="http://beacon.krxd.net/optout_check?callback=Krux.ns._default.kxjsonp_optOutCheck"></script><script charset="UTF-8" type="text/javascript" src="http://cdn.taboola.com/libtrc/impl.174-RELEASE.js"></script><script async="" src="//widget.perfectmarket.com/tribunedigital-network/load.js"></script><script async="" src="http://b.scorecardresearch.com/beacon.js"></script>
<title>Music Boxes - Baltimore City Paper</title>

      <link rel="dns-prefetch" href="//www.trbimg.com" /><link rel="dns-prefetch" href="//static.chartbeat.com" /><link rel="dns-prefetch" href="//loggingservices.tribune.com" /><link rel="dns-prefetch" href="//m.trb.com" /><link rel="dns-prefetch" href="//b.scorecardresearch.com" /><link rel="dns-prefetch" href="//www.google-analytics.com" /><link rel="dns-prefetch" href="http://pubads.g.doubleclick.net" /><link rel="dns-prefetch" href="https://securepubads.g.doubleclick.net" /><link rel="dns-prefetch" href="//secure-us.imrworldwide.com" /><link rel="dns-prefetch" href="//www.googletagservices.com" /><link rel="dns-prefetch" href="http://ssor.tribdss.com" /><link rel="dns-prefetch" href="//cdn.krxd.net" /><link rel="dns-prefetch" href="//cdn.gigya.com" /><link rel="dns-prefetch" href="//cdn.taboola.com" /><meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, minimum-scale=1.0, user-scalable=no" />
    <meta charset="utf-8" />
    <meta name="x-servername" content="i10latisrapp02" />

      <meta name="robots" content="noodp, noydir" />

I use chromedriver so I can clearly see that it scrolls to the bottom. Where could be the problem please?

EDIT:

I've added a web page scraped.

回答1:

You cannot rely on page_source to get the current state of the page. The Python docs do not point it out but if you look at the Java docs of Selenium for getPageSource you'll see:

If the page has been modified after loading (for example, by Javascript) there is no guarantee that the returned text is that of the modified page.

What you can do is ask the browser to serialize the DOM. This will produce HTML that represents the DOM at the time you make the call:

driver.execute_script("return document.documentElement.outerHTML")

回答2:

Are you aware that the page content loads/unloads as you scroll down? The page is unloading previous sections as you scroll down. For instance, scroll all the way down to the bottom of the page and start scrolling back up. You will see that it's loading previous sections.

To prove this... when you first load the page, the first article title is, "The Short List: Weird Al, the Heartless Bastards, Chastity Belt, more". Scroll to the bottom of the page, pull the HTML source (manually), and search for that title. It's not there.

So, I don't know what you are trying to do but if all you want to do is to load the last section you can navigate directly to the last section using the URL, http://www.citypaper.com/music/music-boxes/

The different sections are:

Main article

http://www.citypaper.com/music/music-features/

http://www.citypaper.com/music/listening-party/

http://www.citypaper.com/music/music-boxes/

Why are you wanting the HTML source of the page anyway? What are you trying to accomplish? One of the main points of using Selenium is so you can find HTML tags using locators so you don't have to parse source, etc.

回答3:

I had similar problem. I used time.sleep(5) after get.page_source so that the contents can be read.

来源：https://stackoverflow.com/questions/32285070/driver-doesnt-return-proper-page-source

标签

python

html

selenium

selenium-webdriver

infinite-scroll