Scrolling infinite page with Python/PhantomJS/Selenium

问题

I'm trying to scrape this one (infinite) page (www.mydealz.de) but I cannot get my webdriver to scroll down the page. Im using Python (3.5), Selenium (3.6) and PhantomJS. I already tried several approaches but the webdriver just wont scroll - it just gives me the first page.

1st approach (the ususal scrolling approach):

last_height = driver.execute_script("return document.body.scrollHeight")
while True:
  driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
  time.sleep(1)
  new_height = driver.execute_script("return document.body.scrollHeight")
  if new_height == last_height:
       break
  last_height = new_height

2nd approach (just pressing the down key several times and release it, tried wating inbetween presses, too):

ActionChains(driver).key_down(Keys.ARROW_DOWN).perform()
ActionChains(driver).key_up(Keys.ARROW_DOWN).perform()

3rd approach (find the last element in the "scrolling list" and scroll to its view to force scrolling):

posts = driver.find_elements_by_css_selector("div.threadGrid")
driver.execute_script("arguments[0].scrollIntoView();", posts[-1])

Nothing worked so far, does anybody know if there is another approach or where I made an error?

回答1:

To scroll through the webpage untill the url is mydealz.de/?page=3 you can use the following block of code :

from selenium import webdriver

driver = webdriver.PhantomJS(executable_path=r'C:\\Utility\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe')
driver.set_window_size(1400,1000)
driver.get("https://www.mydealz.de")
while ("3" not in driver.current_url) :
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
print(driver.current_url)
driver.quit()

Console Output :

https://www.mydealz.de/?page=3

回答2:

Easier than going the Selenium/PhantomJS way is to just mimic what the browser does. If you open the "network" tab in Chrome's Developers Tools you see that the browser does requests to https://www.mydealz.de/?page=2&ajax=true in order to achieve the endless scroll. When I copy the request as curl, limit it to the minimum it leads me to

curl 'https://www.mydealz.de/?page=2&ajax=true' -H 'x-requested-with: XMLHttpRequest'

Turning this into a python script:

import json, requests

url = 'http://www.mydealz.de/'
headers = {'x-requested-with': 'XMLHttpRequest'}

for page in range(10):
    params = dict(page=page, ajax='true')
    resp = requests.get(url=url, params=params, headers=headers)
    data = json.loads(resp.text)
    html = data['data']['content']
    # do something with html, maybe parse it with beautifulsoup

In addition to be much a simpler code it will be a lot faster also.

回答3:

I can see theres 1853 pages in the mentioned site. so you can iterate a loop until you reach the last page. Sleep time must above average to load each page, try with minimum 3, the more the value the less chance of not loading the data.

number_of_scroll = 1857

while number_of_scroll > 0:
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(3)
    number_of_scroll = number_of_scroll-1

来源：https://stackoverflow.com/questions/47832281/scrolling-infinite-page-with-python-phantomjs-selenium

标签

python

selenium

web-scraping

phantomjs